1 00:00:04,001 --> 00:00:07,000 Hi, everyone. My name is Chris Freeland and I'm the 2 00:00:07,000 --> 00:00:08,001 librarian at the Internet Archive. 3 00:00:09,001 --> 00:00:13,000 So here we are today at our final session of the Library's Laboratory series. 4 00:00:14,000 --> 00:00:17,001 Over the past 10 weeks, 10 weeks if you can believe it, we've brought together 5 00:00:17,001 --> 00:00:21,001 some of the world's leading scholars in the digital humanities and data intensive 6 00:00:21,001 --> 00:00:25,000 sciences to talk about their projects and how they're using the Internet 7 00:00:25,000 --> 00:00:26,001 Archive in the course of their research. 8 00:00:27,000 --> 00:00:31,000 We started off our series by asking the bold question, what can you do with 9 00:00:31,000 --> 00:00:36,000 billions of archived web pages? And two weeks ago, we ran out the formal part of 10 00:00:36,000 --> 00:00:40,000 our series by asking, what can you do with 60 million digitized pages of 11 00:00:40,000 --> 00:00:45,000 scientific literature? In between, we've heard from bibliographers, authors, 12 00:00:45,001 --> 00:00:49,001 educators, archivists, and data scientists about how they're using materials, 13 00:00:50,000 --> 00:00:54,000 services, and infrastructure from the Internet Archive. The recordings for all of 14 00:00:54,000 --> 00:00:58,000 those sessions are now available online. And I see that Duncan, who's working 15 00:00:58,000 --> 00:01:02,000 behind the scenes today, along with Caitlin, hello both of you and thank you, has 16 00:01:02,000 --> 00:01:03,001 shared that out into the chat. 17 00:01:05,000 --> 00:01:08,000 But we're going to do something a little different today. When we started 18 00:01:08,000 --> 00:01:12,001 planning this series, we had inquiries from a number of scientists who wanted to 19 00:01:12,001 --> 00:01:16,001 do shorter presentations, you know, give something more like a status update or 20 00:01:16,001 --> 00:01:22,001 an overview of the research, not a 45-minute lecture with a long Q&A. So we put 21 00:01:22,001 --> 00:01:27,000 out a call for lightning talks. And today we're going to bring you a variety of 22 00:01:27,000 --> 00:01:32,000 short talks and videos from researchers working on topics as varied as 23 00:01:32,000 --> 00:01:36,000 understanding the effects of emotional cues by the chairs of the U.S. Federal 24 00:01:36,000 --> 00:01:39,001 Reserve on financial markets to a DIY book scanning 25 00:01:39,001 --> 00:01:42,001 robot. So here's the game plan for today. 26 00:01:43,001 --> 00:01:47,001 Live transcripts are available, captions are available. Use the live transcripts 27 00:01:47,001 --> 00:01:54,001 feature of Zoom to turn those on. We also, you can copy from within the chat 28 00:01:54,001 --> 00:01:58,000 by just mousing over anything individual that you're 29 00:01:58,000 --> 00:02:00,000 interested, like links that we're going to share. 30 00:02:00,000 --> 00:02:04,000 And also, we're going to capture all of the links in the chat and make those 31 00:02:04,000 --> 00:02:09,001 available, including the video, all the resources will be available. So if you're 32 00:02:09,001 --> 00:02:14,000 trying to catch something, if you want to remember something, copy a link that 33 00:02:14,000 --> 00:02:18,000 we've shared, rest assured, that will make it into the email that we're going to 34 00:02:18,000 --> 00:02:23,000 send you tomorrow with the video for today's session and the other resources that 35 00:02:23,000 --> 00:02:29,000 we've shared. As you see the chat is open, please do be respectful and keep the 36 00:02:29,000 --> 00:02:35,000 comments on topic and use the Q&A feature to submit questions for our panelists. 37 00:02:35,000 --> 00:02:42,000 And we'll have time for probably one question per speaker. So do please submit 38 00:02:42,000 --> 00:02:46,000 questions for us to gather from. And a final 39 00:02:46,000 --> 00:02:47,001 thing I do want to mention about time. 40 00:02:48,000 --> 00:02:52,000 I anticipate that we're definitely going to run over an hour today, probably more 41 00:02:52,000 --> 00:02:57,001 towards 90 minutes. So for those of you who need to depart at the top of the 42 00:02:57,001 --> 00:03:02,001 hour, rest assured, we're going to record all of this and it will be made 43 00:03:02,001 --> 00:03:08,000 available to everyone. So for now, I can see people are already using the chat. 44 00:03:08,000 --> 00:03:12,000 Please do say hello. Let us know who you are and where you're joining in from 45 00:03:12,000 --> 00:03:18,001 today. So I actually was just looking back over my planning notes and we started 46 00:03:18,001 --> 00:03:24,000 talking about what became this Library as Laboratory series last September. And 47 00:03:24,000 --> 00:03:29,000 like many ideas at the Internet Archive, it burst forth with enthusiasm and gusto 48 00:03:29,000 --> 00:03:33,000 in a meeting with Brewster Cale, the founder and digital librarian of the 49 00:03:33,000 --> 00:03:37,000 Internet Archive. So I'd like to welcome Brewster to the screen. And Brewster, 50 00:03:37,001 --> 00:03:41,001 I'd like to have you share a bit of your thinking. Why was organizing this 51 00:03:41,001 --> 00:03:46,000 Digital Humanities Expo a priority for you and for the Internet 52 00:03:46,000 --> 00:03:49,000 Archive? Thank you, Chris. 53 00:03:54,000 --> 00:03:58,001 This is really fulfilling the dream of the Internet Archive. We started by trying 54 00:03:58,001 --> 00:04:02,001 to archive the Internet and how do you go and do that and then expand it to 55 00:04:02,001 --> 00:04:06,000 archiving other things. But it wasn't just to make it so that you can go and find 56 00:04:06,000 --> 00:04:09,001 old web pages. It was to try to get a bigger view. 57 00:04:10,000 --> 00:04:15,001 Can we go and help make it so that lots of new and different things can happen 58 00:04:15,001 --> 00:04:20,001 without having to build your own collection yourself, build a library that you 59 00:04:20,001 --> 00:04:25,000 can just go and have your new idea and all of the materials needed for your 60 00:04:25,000 --> 00:04:30,000 research are on the shelves, but now just digital shelves. So that was sort of 61 00:04:30,000 --> 00:04:34,001 the impetus for the Internet Archive in general. And so this idea of having 62 00:04:34,001 --> 00:04:38,001 people be able to use these collections at scale has been so important, but 63 00:04:38,001 --> 00:04:43,001 actually it's really pretty hard. The collections are, I think, in pretty good 64 00:04:43,001 --> 00:04:49,000 shape, but they're just huge and hard to find and figuring out how to go and use 65 00:04:49,000 --> 00:04:55,001 them. But we have seen fabulous presenters and including the ones that did 66 00:04:55,001 --> 00:05:00,001 today. But the urgency of misinformation and feeling that our information 67 00:05:00,001 --> 00:05:06,000 ecosystem is out of our control. I think we need, we really need, people with a 68 00:05:06,000 --> 00:05:11,000 macroscope. This is Jesse Ossabell's line. He said, we got really far with a 69 00:05:11,000 --> 00:05:16,000 microscope towards understanding and humans and scientists and science got really 70 00:05:16,000 --> 00:05:23,000 further along. Now we need a bigger picture of what's going on. Can we use web 71 00:05:23,000 --> 00:05:28,001 scale, TV scale, book scale types information to try to help us understand our 72 00:05:28,001 --> 00:05:32,000 world. And I'm jazzed that many of the key players have come forward to speak 73 00:05:32,000 --> 00:05:37,000 about what they're doing. They're so inspiring and inspiring to me and then 74 00:05:37,000 --> 00:05:41,000 they're inspiring to our staff, but I think others to go and say, yes, I can go 75 00:05:41,000 --> 00:05:46,001 and use these sorts of collections at scale to go and do different things. So 76 00:05:46,001 --> 00:05:51,000 what's the value to the Internet Archive of this? Just from a completely selfish 77 00:05:51,000 --> 00:05:55,000 point of view, it helps drive us forward to make more useful collections and 78 00:05:55,000 --> 00:06:01,001 tools. And so for that, we're going to need your feedback as to what the Internet 79 00:06:01,001 --> 00:06:06,001 Archive can do to be a better library. What are the materials, the tools, the 80 00:06:06,001 --> 00:06:12,000 structures that you would like to see or platforms? Is this library as laboratory 81 00:06:12,000 --> 00:06:18,001 series useful to you? We've seen a large number of people join in on these. 82 00:06:19,000 --> 00:06:26,000 So, but please, feedback. That's absolutely essential. And thank you, Chris, for 83 00:06:26,000 --> 00:06:31,001 making all of this come about with a tremendous group of researchers that have 84 00:06:31,001 --> 00:06:37,001 come together. So looking forward to today. Thanks for that, Brewster. Always 85 00:06:37,001 --> 00:06:43,001 good to hear from you and hear what you're thinking about. So as we jump in here 86 00:06:43,001 --> 00:06:48,001 today, I want to let all of you in the audience know what to expect. So we have 87 00:06:48,001 --> 00:06:53,000 three segments of talks. And each segment will have three to four talks. That's 88 00:06:53,000 --> 00:06:58,000 talks and videos. So what we'll do is we'll move from talk to talk quickly, and 89 00:06:58,000 --> 00:07:03,000 we'll wrap up each segment with a quick round of questions and answers. So if you 90 00:07:03,000 --> 00:07:07,001 have questions, again, drop them into the Q&A so that we can we'll probably, as I 91 00:07:07,001 --> 00:07:12,001 said, have time to ask one question aloud. But our speakers are going to be 92 00:07:12,001 --> 00:07:17,000 hanging around. And so if you drop off additional questions, they might be 93 00:07:17,000 --> 00:07:21,000 interested in engaging with you further and answering some of those questions 94 00:07:21,000 --> 00:07:27,001 from the from the Q&A. So please, as you have questions, please do ask them using 95 00:07:27,001 --> 00:07:34,001 the Q&A feature. I'll also mention that Duncan dropped a link into the chat 96 00:07:34,001 --> 00:07:39,001 with the agenda so that you can follow along the order that we're going in today. 97 00:07:40,000 --> 00:07:45,001 So let's get started. Up first today is Kate Miltner from the University of 98 00:07:45,001 --> 00:07:49,001 Edinburgh. Kate's going to tell us about the forgotten histories of the mid 99 00:07:49,001 --> 00:07:52,001 -century coding boot camp. Over to you, Kate. 100 00:07:55,001 --> 00:07:59,001 Hi, everyone. Thanks so much for joining. Just going to share my screen and we 101 00:07:59,001 --> 00:08:06,001 will get this party started. Okay, so 102 00:08:06,001 --> 00:08:11,001 as Chris mentioned, I'm Kate Miltner. I am a Marie Curie postdoctoral fellow at 103 00:08:11,001 --> 00:08:15,000 the University of Edinburgh. And I'm just really grateful to Chris and the 104 00:08:15,000 --> 00:08:19,000 Internet Archive for having me to talk about some of the amazing artifacts that 105 00:08:19,000 --> 00:08:24,000 I've used in my historical research on electronic data processing schools, which 106 00:08:24,000 --> 00:08:29,000 are what I consider to be the mid-century predecessor of the contemporary coding 107 00:08:29,000 --> 00:08:35,000 boot camp. So in the past decade, news stories like these have become 108 00:08:35,000 --> 00:08:40,000 increasingly familiar. I'm sure most, if not all of you, have seen or read an 109 00:08:40,000 --> 00:08:43,000 article that talks about learning to code and what it can 110 00:08:43,000 --> 00:08:45,001 accomplish. Maybe you even wrote one of them. 111 00:08:46,001 --> 00:08:51,001 As a part of my PhD thesis, I read over 200 articles over a 10-year period that 112 00:08:51,001 --> 00:08:56,000 talked about learning to code. In reading all of these articles, it became clear 113 00:08:56,000 --> 00:09:00,000 that coding has been positioned as a solution to a variety of interconnected 114 00:09:00,000 --> 00:09:05,001 social issues, including concerns about AI, gender and racial bias in the tech 115 00:09:05,001 --> 00:09:11,000 sector, economic inequality, and skills gaps that have supposedly left millions 116 00:09:11,000 --> 00:09:15,000 of well-paid positions unfilled due to a lack of appropriate training. 117 00:09:16,001 --> 00:09:21,000 In response to this discourse, an entire industry of coding boot camps has 118 00:09:21,000 --> 00:09:26,000 developed in the US and across the world. Coding boot camps are short-term 119 00:09:26,000 --> 00:09:30,000 intensive courses that aim to make professional technologists out of technical 120 00:09:30,000 --> 00:09:36,000 novices. The topics of these programs can vary. Some are focused on data science, 121 00:09:36,001 --> 00:09:41,000 some on user experience, some on software engineering. There's even programs for 122 00:09:41,000 --> 00:09:45,000 digital marketing and product management. The length of time can vary too. Most 123 00:09:45,000 --> 00:09:49,001 are around three to four months, but some even last as long as two years. But 124 00:09:49,001 --> 00:09:53,001 independent of the topic and the timeline, the promises that these programs make 125 00:09:53,001 --> 00:09:58,000 are usually the same. Stick with us and the tech career of your dreams will be 126 00:09:58,000 --> 00:10:04,000 right within reach. It's a pretty compelling promise. Across the US and Canada 127 00:10:04,000 --> 00:10:09,001 alone, the coding boot camp industry trained almost 25,000 people a year in 2020, 128 00:10:10,000 --> 00:10:16,000 pulling in almost 350 million dollars. One of the claims made by coding boot 129 00:10:16,000 --> 00:10:20,000 camps is that they offer a novel solution for addressing some of the major 130 00:10:20,000 --> 00:10:24,001 problems within tech education. First, they're a lot shorter than a four-year 131 00:10:24,001 --> 00:10:28,001 college degree. Second, they're supposed to have a much lower sticker price. 132 00:10:29,000 --> 00:10:32,001 Third, they're supposed to be more accessible to groups that are traditionally 133 00:10:32,001 --> 00:10:37,000 excluded from the tech industry. And finally, they're more agile than a 134 00:10:37,000 --> 00:10:40,001 university, which is supposed to make them responsive to the needs of industry. 135 00:10:41,000 --> 00:10:45,001 With all of these factors combined, they're meant to be a new and ideal way to 136 00:10:45,001 --> 00:10:51,001 respond to the challenges of what some call the fourth industrial revolution. But 137 00:10:51,001 --> 00:10:55,001 as any tech scholar will tell you, the likelihood of something tech related from 138 00:10:55,001 --> 00:11:00,001 the current moment being completely brand new is pretty unlikely. So I looked to 139 00:11:00,001 --> 00:11:05,000 find the historical roots of the coding boot camp. And what I was able to find 140 00:11:05,000 --> 00:11:07,000 was surprising, even to me. 141 00:11:08,001 --> 00:11:12,000 As I was reading through some work in the history of computing, I came across 142 00:11:12,000 --> 00:11:17,000 Nathan Entsmanger's fantastic book about the software industry in the mid 20th 143 00:11:17,000 --> 00:11:21,001 centuries. In it, he wrote about the prevalence of data, sorry, electronic data 144 00:11:21,001 --> 00:11:28,000 processing schools or EDP schools in the 1960s and 70s. EDP schools were 145 00:11:28,000 --> 00:11:31,001 vocational schools that offered short-term courses in computer programming. 146 00:11:32,000 --> 00:11:36,000 And much like coding boot camps, these privately run schools aim to train people 147 00:11:36,000 --> 00:11:42,000 for jobs in industry. After reading about EDP schools, I started looking for some 148 00:11:42,000 --> 00:11:47,000 primary historical materials about them. Based on the sources available to me at 149 00:11:47,000 --> 00:11:51,001 the university where I did my PhD, I was able to find a few newspaper articles. I 150 00:11:51,001 --> 00:11:55,000 also found ads for one franchise of schools in particular, which was the 151 00:11:55,000 --> 00:11:58,001 Electronic Computer Programming Institute or the ECPI. 152 00:11:59,000 --> 00:12:03,001 In the ads especially, I began to see some similarities between the sales pitches 153 00:12:03,001 --> 00:12:08,000 of EDP schools and the sales pitches of coding boot camps, especially around how 154 00:12:08,000 --> 00:12:12,001 easy programming can be and the number of available jobs in the computing 155 00:12:12,001 --> 00:12:18,000 industry. I was hoping to find some more materials about EDP schools in 156 00:12:18,000 --> 00:12:21,001 traditional archives like the Charles Babbage Institute at the University of 157 00:12:21,001 --> 00:12:26,000 Minnesota. But because most EDP schools were either regional, short-lived, or 158 00:12:26,000 --> 00:12:30,000 both, the finding aids for the archives suggested that there wasn't a lot of 159 00:12:30,000 --> 00:12:31,001 archival material to be found. 160 00:12:32,001 --> 00:12:37,001 But then, thanks to the junk mail collection of a computing pioneer named Ted 161 00:12:37,001 --> 00:12:42,000 Nelson and the wonders of digitization, I stumbled upon something that I thought 162 00:12:42,000 --> 00:12:47,001 I'd never find, which was an original recruitment brochure from the ECPI from the 163 00:12:47,001 --> 00:12:54,000 late 1960s. The ECPI booklet was only 16 pages long, but it was a crucial piece 164 00:12:54,000 --> 00:12:58,000 of evidence in my historical research that allowed me to make important links 165 00:12:58,000 --> 00:13:04,001 between the past and present of computing education. It was really remarkable how 166 00:13:04,001 --> 00:13:09,000 many similarities there were between how EDP schools and coding boot camps framed 167 00:13:09,000 --> 00:13:15,000 themselves and what they have to offer. First, both kinds of organizations framed 168 00:13:15,000 --> 00:13:19,000 a career in computing as accessible to anyone and a good way to guard 169 00:13:19,000 --> 00:13:20,001 against the threat of automation. 170 00:13:22,001 --> 00:13:27,001 EDP schools also focused on the wide availability of computing jobs, much like 171 00:13:27,001 --> 00:13:33,001 contemporary discussions of the skills gap. They also focused on computing as an 172 00:13:33,001 --> 00:13:38,000 inclusive career pathway for women and people of color. This was actually pretty 173 00:13:38,000 --> 00:13:42,000 remarkable for the 1960s. Out of the 10 students showcased for successful 174 00:13:42,000 --> 00:13:46,001 placements, over half of them were women or people from minoritized ethnic 175 00:13:46,001 --> 00:13:52,001 groups. Finally, the ECPI booklet underscored their links with industry, much 176 00:13:52,001 --> 00:13:58,000 like coding boot camps do today. Of course, the image that an organization 177 00:13:58,000 --> 00:14:03,001 presents to the world and its reality can be starkly different. So I then began 178 00:14:03,001 --> 00:14:07,001 to look into whether the image that ECPI presented connected with other parts of 179 00:14:07,001 --> 00:14:13,000 the historical record. To do this, I turned to another fantastic Internet Archive 180 00:14:13,000 --> 00:14:17,001 resource, which was their digitized collection of computer world. To have this as 181 00:14:17,001 --> 00:14:21,001 a searchable digital archive was really incredible because it contained some 182 00:14:21,001 --> 00:14:27,000 remarkably rich material that I probably wouldn't have found otherwise. One of 183 00:14:27,000 --> 00:14:31,000 the key directions that the computer world archive encouraged me to pursue in my 184 00:14:31,000 --> 00:14:35,001 research was the disconnect between the claims of inclusivity made about the 185 00:14:35,001 --> 00:14:39,001 computing industry at the time and the reality of the computing labor market. 186 00:14:40,001 --> 00:14:44,001 There was a ton of coverage in computer world about EDP schools in the 1960s and 187 00:14:44,001 --> 00:14:50,000 70s, and a lot of it highlighted many of the issues with these organizations. A 188 00:14:50,000 --> 00:14:54,001 series of articles published from 1969 to 1970 showed how these claims of 189 00:14:54,001 --> 00:14:59,001 accessibility were not necessarily so true. Despite a purported programmer 190 00:14:59,001 --> 00:15:03,001 shortage, these articles illustrated how Black programming trainees found it next 191 00:15:03,001 --> 00:15:08,000 to impossible to get hired for programming jobs. This was a very different story 192 00:15:08,000 --> 00:15:10,000 than the one told by the ECPI booklet. 193 00:15:11,001 --> 00:15:15,001 What the computer world archive also highlighted was how problematic 194 00:15:15,001 --> 00:15:17,000 so many of these schools were. 195 00:15:17,001 --> 00:15:22,000 In Nathan Ensminger's book, he discussed how many companies had employed a no-EDP 196 00:15:22,000 --> 00:15:26,001 school graduate policy due to the variable quality of these schools. Articles 197 00:15:26,001 --> 00:15:30,001 that I found in computer world gave clear insight into how these organizations 198 00:15:30,001 --> 00:15:37,001 operated and the dire situations that some of their students were left in. So 199 00:15:37,001 --> 00:15:41,001 you may be asking yourself, okay, so what does the history of EDP schools in the 200 00:15:41,001 --> 00:15:46,001 1960s and 70s have to do with today? There's a reason that this history is worth 201 00:15:46,001 --> 00:15:49,001 paying attention to, and that's because it threatens to repeat 202 00:15:49,001 --> 00:15:51,001 itself in the current moment. 203 00:15:53,000 --> 00:15:57,000 The ECPI booklet showed us that there are a lot of similarities between how 204 00:15:57,000 --> 00:16:01,000 coding bootcamps and EDP schools presented both themselves and the 205 00:16:01,000 --> 00:16:02,001 potential benefits of learning to code. 206 00:16:03,001 --> 00:16:06,000 But the computer world archive shows us that there may be 207 00:16:06,000 --> 00:16:07,001 some other similarities as well. 208 00:16:08,001 --> 00:16:13,000 The story of EDP schools didn't end well, either for many of the students or for 209 00:16:13,000 --> 00:16:18,000 the schools themselves. They were shut down and liquidated by the mid-1970s after 210 00:16:18,000 --> 00:16:22,000 a federal investigation cracked down on fraudulent vocational schools. Others 211 00:16:22,000 --> 00:16:26,001 just went out of business because no one wanted to go anymore. Of course, the 212 00:16:26,001 --> 00:16:30,000 future of coding bootcamps has yet to be written, but there are some warning 213 00:16:30,000 --> 00:16:34,000 signs that there might be more similarities between bootcamps and EDP schools 214 00:16:34,000 --> 00:16:38,000 than we might hope. One of my goals for this project was to point out how the 215 00:16:38,000 --> 00:16:42,000 past lives on in the present so that the mistakes of the past can hopefully be 216 00:16:42,000 --> 00:16:47,001 avoided. If you've enjoyed this talk, I have a journal article coming out this 217 00:16:47,001 --> 00:16:51,001 fall in Information at Culture that talks about the commonalities between coding 218 00:16:51,001 --> 00:16:55,001 bootcamps and EDP schools at length. So please do keep an eye out. 219 00:16:57,000 --> 00:17:00,000 Thanks so much for your attention. And if you'd like to get in touch or put 220 00:17:00,000 --> 00:17:07,000 questions in the Q&A, please do. Thanks so much, Kate. Really appreciate that 221 00:17:07,000 --> 00:17:11,000 excellent overview. We have some questions that have come in. I'm gathering 222 00:17:11,000 --> 00:17:16,000 those. If others in the audience have questions, please do use the Q&A feature to 223 00:17:16,000 --> 00:17:22,000 drop those off. And what we'll do now is move on to the next talk. And so up next 224 00:17:22,000 --> 00:17:28,001 is Tom Galli from the University of Tokyo. Now, Tom is in Japan, and he will be 225 00:17:28,001 --> 00:17:33,000 watching this time shifted. So this is a wave to Tom in the future. 226 00:17:33,000 --> 00:17:34,001 He will be watching this in the past. 227 00:17:35,000 --> 00:17:41,000 I'm getting a little confused, but let's let Tom explain in this video some of 228 00:17:41,000 --> 00:17:48,000 his research about Japan. So, Caitlin, you want to take it away? Japan is 229 00:17:48,000 --> 00:17:54,000 now a modern country, not too different from many other places. But when it 230 00:17:54,000 --> 00:17:59,001 opened up to the world in the middle of the 19th century, Japan seemed exotic and 231 00:17:59,001 --> 00:18:04,001 mysterious to the first Western visitors. People who couldn't visit were even 232 00:18:04,001 --> 00:18:10,001 more curious about the country. So to satisfy that curiosity, and to make some 233 00:18:10,001 --> 00:18:15,001 money, those visitors wrote books about Japan for the people back home. 234 00:18:17,000 --> 00:18:22,000 Japan, as they saw it, is a collection of excerpts from those books. They show 235 00:18:22,000 --> 00:18:27,000 how those visitors, mostly American or British, describe the country and its 236 00:18:27,000 --> 00:18:34,000 people. The cities, the countryside, the clothing, the religions, the 237 00:18:34,000 --> 00:18:37,000 theater and festivals, even the smells. 238 00:18:38,001 --> 00:18:44,000 They tell about writing in rickshaws, attending weddings, meeting prostitutes, 239 00:18:44,001 --> 00:18:51,000 experiencing earthquakes. I got the idea to create Japan as they saw it, after I 240 00:18:51,000 --> 00:18:54,000 came across some of those old books at the Internet Archive. 241 00:18:56,000 --> 00:19:00,001 You see, I myself moved to Japan in 1983 from California, 242 00:19:01,001 --> 00:19:03,000 and I've lived here ever since. 243 00:19:04,000 --> 00:19:09,001 So it was fascinating for me to think about how my own first impressions of the 244 00:19:09,001 --> 00:19:15,001 country, and what I had told my friends and family back home about it, compared 245 00:19:15,001 --> 00:19:20,001 with what people had written about Japan a century earlier. I thought a 246 00:19:20,001 --> 00:19:25,000 collection of those earlier impressions might be interesting for others to read 247 00:19:25,000 --> 00:19:31,001 too. So I first prepared a list of over 240 books about Japan at the 248 00:19:31,001 --> 00:19:36,000 Internet Archive, published between 1855 and 1912. 249 00:19:37,001 --> 00:19:42,000 I went through those books and chose passages that seemed interesting or amusing, 250 00:19:42,001 --> 00:19:47,000 and I put them all on this website. It's also available as an e-book. 251 00:19:48,001 --> 00:19:53,000 Each excerpt is linked to the original book at the Internet Archive. And like 252 00:19:53,000 --> 00:19:57,001 those books at the Internet Archive, Japan as they saw it is free for 253 00:19:57,001 --> 00:19:59,000 anyone in the world to read. 254 00:20:00,001 --> 00:20:04,001 Thank you once again, Internet Archive, for making this possible. 255 00:20:09,000 --> 00:20:14,000 And a thanks to you, Tom, for putting that video together. And if anyone has a 256 00:20:14,000 --> 00:20:17,000 question, yes. We've already reached out to Tom and asked him if we would create 257 00:20:17,000 --> 00:20:20,001 other videos for other parts of our collection, because I thought that was really 258 00:20:20,001 --> 00:20:27,000 great. Also stay tuned. Tom is going to close out our show with another video on 259 00:20:27,000 --> 00:20:31,000 forgotten novels of the 19th century, which is really fun research. But up next, 260 00:20:31,000 --> 00:20:36,000 we have one of my favorite people on the planet, a colleague who I've worked with 261 00:20:36,000 --> 00:20:41,000 for years in a variety of ways, and I'm really pleased to welcome him to the 262 00:20:41,000 --> 00:20:47,001 stage virtually to tell us about his research. So that's Rod Page from the 263 00:20:47,001 --> 00:20:51,001 University of Glasgow, talking about the bibliography of life. Over to you, Rod. 264 00:21:01,001 --> 00:21:06,001 Okay. Thank you very much for that lovely introduction, Chris. So I'm going to 265 00:21:06,001 --> 00:21:11,000 talk about the bibliography of life. So I guess what I should do is 266 00:21:11,000 --> 00:21:13,000 define what I mean by that. 267 00:21:14,000 --> 00:21:19,000 So a number of people who work in the field of biodiversity and taxonomy have the 268 00:21:19,000 --> 00:21:24,001 stream of the bibliography of life. And basically, it's access to every taxonomic 269 00:21:24,001 --> 00:21:29,001 paper published on every species ever described. So to give you a sense of the 270 00:21:29,001 --> 00:21:33,001 scale, we think there are probably about two million species described on the 271 00:21:33,001 --> 00:21:37,000 planet today. There's probably about 10 million in total, so lots to be 272 00:21:37,000 --> 00:21:41,001 discovered. So we're thinking in terms of hundreds of thousands, perhaps even a 273 00:21:41,001 --> 00:21:46,001 million or so publications that describe these species. Now sometimes when we 274 00:21:46,001 --> 00:21:51,001 talk about this notion of a bibliography of life, some people get upset because 275 00:21:51,001 --> 00:21:56,001 it seems to focus on taxonomy and what's so special about taxonomy. What about 276 00:21:56,001 --> 00:22:01,000 all these major biomedical databases such as PubMed, for example, lots of 277 00:22:01,000 --> 00:22:05,001 information on medicine and so on. I want to make a case that there's something 278 00:22:05,001 --> 00:22:12,001 special about taxonomy. Taxonomy and biodiversity in many ways, it's not 279 00:22:12,001 --> 00:22:15,000 big data, it's long data. 280 00:22:15,001 --> 00:22:19,001 We have lots and lots of long tales. What you can see in the slide here on the 281 00:22:19,001 --> 00:22:25,001 right is a little summary of the size of pages in Wikipedia 282 00:22:25,001 --> 00:22:30,001 for different mammal species. So some mammals, for example, lions and mice, 283 00:22:31,000 --> 00:22:35,001 charismatic animals or medically important animals, have really large Wikipedia 284 00:22:35,001 --> 00:22:40,001 entries. Down the bottom of this chart you can see there are literally thousands 285 00:22:40,001 --> 00:22:44,000 of pages of Wikipedia on these mammals that are very, very small. 286 00:22:45,000 --> 00:22:49,001 So for many species on the planet, almost all that we know about them in terms of 287 00:22:49,001 --> 00:22:52,001 their ecology, their morphology, what they do, where they are, 288 00:22:53,000 --> 00:22:54,001 will come from the taxonomic literature. 289 00:22:55,001 --> 00:23:02,001 Now, taxonomy itself is also very much a subject of a long tale. So 290 00:23:02,001 --> 00:23:08,000 we have some very large, prominent journals such as Zutaxa that publish tens of 291 00:23:08,000 --> 00:23:09,001 thousands of new species descriptions. 292 00:23:10,001 --> 00:23:15,001 But also there is a very, very long tale, in many cases of often very small 293 00:23:15,001 --> 00:23:20,000 journals, that are often very niche in terms of the taxonomic 294 00:23:20,000 --> 00:23:21,001 focus or the geographic focus. 295 00:23:21,001 --> 00:23:25,001 And again, they will be a reservoir of lots of information about these species. 296 00:23:26,000 --> 00:23:29,001 So again, if we're going to get a nice comprehensive bibliography of life, we're 297 00:23:29,001 --> 00:23:35,001 going to have to go hunting for those. Now, those of you, I guess like myself and 298 00:23:35,001 --> 00:23:40,000 Chris have been around for a while in this area, you might be thinking, hang on a 299 00:23:40,000 --> 00:23:43,001 second, this sounds already familiar. What about the Biodiversity Heritage 300 00:23:43,001 --> 00:23:47,001 Library, the BHL? Isn't this what they're doing? And we had a really interesting 301 00:23:47,001 --> 00:23:52,001 presentation a couple of weeks ago about them. Well, there is some overlap, but 302 00:23:52,001 --> 00:23:57,000 BHL suffers from what I'm going to call the Mickey Mouse Gap, which is this thing 303 00:23:57,000 --> 00:24:01,001 that the impact of copyright has had a huge sort of dampening effect on BHL's 304 00:24:01,001 --> 00:24:06,001 coverage. What I'm trying to capture in this diagram here is in blue, you can see 305 00:24:06,001 --> 00:24:10,000 for each decade, how many publications are out there describing new species of 306 00:24:10,000 --> 00:24:15,000 animals, just a sort of lower balance on that. And in red is the articles in BHL. 307 00:24:16,001 --> 00:24:21,001 And you can see that the coverage in BHL dips dramatically after about 1923, when 308 00:24:21,001 --> 00:24:26,001 copyright in America kicks in, and it sort of starts to increase a bit. But 309 00:24:26,001 --> 00:24:30,001 there's this big area in blue, and that's the stuff that I'm after. All these 310 00:24:30,001 --> 00:24:36,001 descriptions of species, mostly in the 20th century, that aren't in BHL. So one 311 00:24:36,001 --> 00:24:42,000 solution is to try and capture these articles and put them somewhere safe. Now, a 312 00:24:42,000 --> 00:24:46,000 lot of this material has actually been digitized and is freely available in 313 00:24:46,000 --> 00:24:49,001 various websites. So I spent some time trying to grab these together. And I've 314 00:24:49,001 --> 00:24:54,000 sort of created almost a mini BHL on the Internet Archive collecting these 315 00:24:54,000 --> 00:24:57,001 publications. They're full of beautiful photographs of species, also information 316 00:24:57,001 --> 00:25:02,001 on geography and maps and climate, and also the people who study these species. 317 00:25:03,001 --> 00:25:06,001 So that's one approach, is just to try and gather as many of these publications 318 00:25:06,001 --> 00:25:13,001 as possible. I briefly talked about the long tail of taxonomic publications, 319 00:25:13,001 --> 00:25:17,001 or these small journals. Many of these journals are as endangered as species or 320 00:25:17,001 --> 00:25:21,000 indeed taxonomists themselves. There are journals that just simply vanish, like 321 00:25:21,000 --> 00:25:25,001 this one on the left, the recent Italian journal that's just gone. There are also 322 00:25:25,001 --> 00:25:31,001 journals that vanish but sort of come back as zombies, taken over by bad people. 323 00:25:32,001 --> 00:25:37,000 On the right, there's a journal that used to publish on butterfly taxonomy. The 324 00:25:37,000 --> 00:25:41,000 person who ran that eventually sort of gave up, it was too hard to sort of 325 00:25:41,000 --> 00:25:45,001 maintain a taxonomic journal. Somebody else came along, took that domain name, 326 00:25:46,000 --> 00:25:50,001 and is now publishing things that anything but butterfly taxonomy. And this is 327 00:25:50,001 --> 00:25:54,001 where the Wayback Machine has been incredibly useful to try and retrieve these 328 00:25:54,001 --> 00:25:58,000 old journals and their content and also discover the history of some of these 329 00:25:58,000 --> 00:26:03,001 journals which have been hijacked. So part of this exercise is just getting the 330 00:26:03,001 --> 00:26:09,001 content. One thing that in a sense, Internet Archive probably isn't particularly 331 00:26:09,001 --> 00:26:13,001 great at is metadata. You can put quite a bit in, but for really detailed 332 00:26:13,001 --> 00:26:18,001 metadata, I've turned to Wikidata. So Wikidata has this extraordinary resource. 333 00:26:18,001 --> 00:26:23,000 It's a great editing interface, and there's an enormous community of people 334 00:26:23,000 --> 00:26:26,001 contributing. So for example, if you have an article like this one, it's in 335 00:26:26,001 --> 00:26:31,000 English and Chinese, you can have titles in both languages. But I think it really 336 00:26:31,000 --> 00:26:36,001 motivates my use of Wikidata is to try and connect all these things together. So 337 00:26:36,001 --> 00:26:41,001 I guess what I'm aiming for ultimately is something a bit like this. There's a 338 00:26:41,001 --> 00:26:46,000 link, I think this is also going to be in the show notes. This is a little app 339 00:26:46,000 --> 00:26:51,001 that I built that essentially talks to Wikidata, and you can look at information, 340 00:26:51,001 --> 00:26:56,001 say, on species on the left. You can get information on journals, publish papers 341 00:26:56,001 --> 00:27:00,001 on these species. And you can also then bounce to the taxonomist that people 342 00:27:00,001 --> 00:27:04,001 actually doing the research. So the goal is to have this kind of network of 343 00:27:04,001 --> 00:27:09,000 linked things, images of species, journal articles, and the taxonomist 344 00:27:09,000 --> 00:27:12,001 themselves. And these little thumbnails you can see in these pictures here, 345 00:27:13,000 --> 00:27:16,001 they're all coming from Internet Archive. So these are articles that are freely 346 00:27:16,001 --> 00:27:22,000 available that people can read and see pictures of these species. So we're next 347 00:27:22,000 --> 00:27:28,000 for this project. I guess I'm fairly simple minded in a way. One of it is just to 348 00:27:28,000 --> 00:27:31,001 try and get more content and get as much of that content digitally preserved as 349 00:27:31,001 --> 00:27:36,000 possible, which Internet Archive is one obvious route to go down that, and to use 350 00:27:36,000 --> 00:27:39,001 the way back machine to go and try and retrieve some of these journals that have 351 00:27:39,001 --> 00:27:43,000 come and gone. And the second stage is just to try and link all these things 352 00:27:43,000 --> 00:27:47,001 together. So we don't just have these digital artifacts that are sitting there 353 00:27:47,001 --> 00:27:51,001 preserved in some way. We can actually bounce between them and explore those 354 00:27:51,001 --> 00:27:53,001 connections. Thanks very much for your time. 355 00:27:56,000 --> 00:28:01,000 Thank you so much, Rod. And if you would stay on screen and if Kate would come 356 00:28:01,000 --> 00:28:05,001 back on, I would love to do a quick round of questions. I do have a question for 357 00:28:05,001 --> 00:28:09,000 each of you. And if again, if there are additional questions, do please drop them 358 00:28:09,000 --> 00:28:16,000 in the Q&A. But for Kate, the question is, how much stuff is in Ted Nelson's junk 359 00:28:16,000 --> 00:28:18,001 mail collection and how did you search through it? 360 00:28:19,000 --> 00:28:23,001 Seems like finding a needle in a haystack. So actually, I have to, I actually 361 00:28:23,001 --> 00:28:28,001 have to thank Google for that one. I was just looking for 362 00:28:28,001 --> 00:28:35,001 information on EDP schools. Actually, I kind of don't, I think it was 363 00:28:35,001 --> 00:28:42,000 just Google. I was looking for ECPI. I was trying to find information on ECPI and 364 00:28:42,000 --> 00:28:48,001 because the guide had been OCR'd, it showed up. And I just remember being like, 365 00:28:49,000 --> 00:28:54,001 who's, what is this junk mail collection? Like what is even happening? And then I 366 00:28:54,001 --> 00:28:59,001 clicked through and I was like, this is a miracle. Holy cow. I can't believe that 367 00:28:59,001 --> 00:29:04,001 there's this whole thing. And you know, more than the actual booklet itself, 368 00:29:04,001 --> 00:29:10,000 there were other materials, including some news articles that had come with the 369 00:29:10,000 --> 00:29:14,001 packet that sort of had encouraged sort of said like, oh, you know, look how 370 00:29:14,001 --> 00:29:19,000 great, you know, learning to code or, you know, computer learning computer 371 00:29:19,000 --> 00:29:23,001 programming is and you're going to make so much money in all of that. So I have 372 00:29:23,001 --> 00:29:29,000 not actually seen enough of, I think, the junk mail file, but I should 373 00:29:29,000 --> 00:29:31,000 definitely go splunking one day. 374 00:29:31,001 --> 00:29:36,000 That's I bet many of the researchers here in the audience have the same 375 00:29:36,000 --> 00:29:39,001 experience. I know I did, you know, doing a Google search for something kind of 376 00:29:39,001 --> 00:29:44,001 obscure in your in your discipline and like, lo and behold, there it is somewhere 377 00:29:44,001 --> 00:29:48,001 at the Internet Archive. Like I wouldn't have never thought to go look in the Ted 378 00:29:48,001 --> 00:29:51,000 Nelson junk mail collection for this thing that I needed. Yeah. 379 00:29:51,001 --> 00:29:57,000 Thank you for that. Kate, a question for you, Rod from, we'll pick this one up 380 00:29:57,000 --> 00:30:01,001 here from Martin Kaufatevik. Now, I know that you're not a copyright lawyer, but 381 00:30:01,001 --> 00:30:06,000 I bet you have some opinions on what the question is that Martin is asking, which 382 00:30:06,000 --> 00:30:10,000 is that current global copyright laws are terrible for science and for culture, 383 00:30:10,001 --> 00:30:13,000 but people go to jail or worse for violating them. 384 00:30:13,001 --> 00:30:17,000 Do you have any thoughts on how institutions can risk manage that? 385 00:30:19,000 --> 00:30:25,000 Well, that's a question. I guess one thing to point out. So most of the content 386 00:30:25,000 --> 00:30:30,000 that I'm going after, I mean, a lot of it is it's quote unquote freely available 387 00:30:30,000 --> 00:30:36,000 or some of it is explicitly freely licensed. So I think there's an element of if 388 00:30:36,000 --> 00:30:40,000 you model is negotiate with the providers to sign agreements to get the content, 389 00:30:40,000 --> 00:30:43,001 which I guess is the way BHL does, which is all proper, that mitigates the risk. 390 00:30:45,000 --> 00:30:48,000 I've sort of taken the approach, well, if it's not kind of like having to 391 00:30:48,000 --> 00:30:51,001 paywall, they're not expressly saying, do not have this, then we can at least try 392 00:30:51,001 --> 00:30:55,001 and make copies of it so that it doesn't go away. And again, we've seen 393 00:30:55,001 --> 00:30:59,000 repeatedly, you know, by the time we could have had these 394 00:30:59,000 --> 00:31:00,001 negotiations, some of these journals are just gone. 395 00:31:02,000 --> 00:31:07,000 So I take Martin's point, and I guess that's an interesting question for the 396 00:31:07,000 --> 00:31:10,001 Internet Archive itself. It's an institution that presumably tries to 397 00:31:10,001 --> 00:31:12,000 mitigate the risk of dealing with this issue. 398 00:31:14,001 --> 00:31:19,000 And Lila Bailey from the policy expert at the Internet Archive has entered the 399 00:31:19,000 --> 00:31:24,001 chat and dropped off some thoughts, as did Brewster. So we'll let that 400 00:31:24,001 --> 00:31:28,001 conversation continue on asynchronously. Kate, Rod, thank you much for your 401 00:31:28,001 --> 00:31:35,001 presentations today. If you have any questions, drop them into the Q&A, and we'll 402 00:31:35,001 --> 00:31:40,000 pick up with our next speaker. And so I would like to welcome Michelle 403 00:31:40,000 --> 00:31:46,001 Alexopoulos to the screen today. And Michelle is going to tell you about her 404 00:31:46,001 --> 00:31:53,001 research looking at expressions on the federal chairs, the members of the 405 00:31:53,001 --> 00:32:00,001 Board of Federal Reserve Board Chairs, and how their expressions affect financial 406 00:32:00,001 --> 00:32:02,001 markets. So over to you, Michelle. 407 00:32:03,001 --> 00:32:09,001 Thank you very much. I would like to thank the organizers, obviously, for putting 408 00:32:09,001 --> 00:32:16,001 together such an interesting and eclectic group of papers. So as Christa said, 409 00:32:16,001 --> 00:32:20,001 I want to talk about more than words. Basically, this is an analysis of the 410 00:32:20,001 --> 00:32:25,001 Federal Reserve Board Chairs' communication during congressional testimony based 411 00:32:25,001 --> 00:32:29,000 on a project that I'm doing with members of Bank of Canada, specifically Zhen 412 00:32:29,000 --> 00:32:33,000 Tanhan, Alaski, Prashof, and Xu Zhang. The link to our newly released working 413 00:32:33,000 --> 00:32:37,001 paper is available there. Now, in terms of the product data sources, this 414 00:32:37,001 --> 00:32:41,001 wouldn't have happened unless we had access to a tremendous amount of resources. 415 00:32:42,000 --> 00:32:46,000 We're looking for audio, video, as well as textual. From the Internet Archive, 416 00:32:46,001 --> 00:32:51,001 we've primarily used the TV Archive content. These are videos from C-SPAN, CNBC, 417 00:32:52,000 --> 00:32:56,001 Bloomberg TV, and the House and Senate videos. And when there's polls in that 418 00:32:56,001 --> 00:33:02,001 coverage, we've then turned to C-SPAN archives, YouTube videos, textual 419 00:33:02,001 --> 00:33:07,001 sources that would give us transcript information. And what we did is we analyzed 420 00:33:07,001 --> 00:33:11,001 all of this and then tried to blend it with basically tick by tick stock data 421 00:33:11,001 --> 00:33:14,001 that came from places like Wharton and Refititive, etc. 422 00:33:15,001 --> 00:33:19,000 So the overview of the project is pretty straightforward. What we wanted to do is 423 00:33:19,000 --> 00:33:23,000 examine the importance and the impact of the US Federal Reserve communication. 424 00:33:23,001 --> 00:33:27,001 We're going to measure aspects of the communication using AI, natural language 425 00:33:27,001 --> 00:33:31,001 processing, and other tools. And then we want to test if and how the words 426 00:33:31,001 --> 00:33:34,001 chosen, their body language, and their tone of voice is actually affecting 427 00:33:34,001 --> 00:33:39,000 markets. So when we think about policy communication, I understand we have a very 428 00:33:39,000 --> 00:33:44,000 large audience. When I talk about monetary policy and policy communications, I'm 429 00:33:44,000 --> 00:33:48,000 talking about things such as interest rate movements for the policy rate, forward 430 00:33:48,000 --> 00:33:52,001 guidance, quantitative easing, or bank balance sheet operations. These things 431 00:33:52,001 --> 00:33:56,001 have very large impacts on the welfare of businesses and individuals in our 432 00:33:56,001 --> 00:34:01,000 economy, not just in the United States, but because they're so large around the 433 00:34:01,000 --> 00:34:05,000 world. Now, these policies are often fairly complex. Now, to be effective, 434 00:34:05,001 --> 00:34:09,000 central bank communications need to be accurate. They need to be clear. They need 435 00:34:09,000 --> 00:34:12,001 to be perceived as credible, and they need to reach their target audience. And 436 00:34:12,001 --> 00:34:17,000 it's true that when the central bank takes actions or releases communications, we 437 00:34:17,000 --> 00:34:21,001 all are not necessarily watching the actual release, but we're also getting our 438 00:34:21,001 --> 00:34:25,001 information through media. So we have two different types of channels where we 439 00:34:25,001 --> 00:34:28,001 get this information. It could be direct communication, such as the press 440 00:34:28,001 --> 00:34:33,000 conferences themselves, or through watching things like the testimony, or even 441 00:34:33,000 --> 00:34:38,000 the press releases, such as the FOMC statements that were released last week, or 442 00:34:38,000 --> 00:34:41,000 we can have the indirect types of communications that we often see captured by 443 00:34:41,000 --> 00:34:47,001 the Internet Archives TV archive, which deals with the coverage on CNBC, Fox 444 00:34:47,001 --> 00:34:53,000 News, or even things like Bloomberg TV. This is also then amplified by sort of 445 00:34:53,000 --> 00:34:57,000 traditional things such as the New York Times and Twitter. Now, when we think 446 00:34:57,000 --> 00:35:01,001 about communication, communication is obviously more than words. And this has 447 00:35:01,001 --> 00:35:05,001 been popularized for a very long time, ever since the 1970s. I'm sure some of you 448 00:35:05,001 --> 00:35:10,000 have heard that the statement that over 90% of communication it comes from your 449 00:35:10,000 --> 00:35:14,001 body language and tone of voice. Now, although the Moravian studies are probably 450 00:35:14,001 --> 00:35:19,001 not getting exactly the right numbers in terms of which quantities, the idea is 451 00:35:19,001 --> 00:35:25,000 pretty much accepted. So you can see here even a few pictures of Ben Bernanke and 452 00:35:25,000 --> 00:35:28,000 Janet Yellen, who are two of the Federal Reserve chairs, and you can see 453 00:35:28,000 --> 00:35:32,001 sometimes they look very relaxed, and sometimes they look a lot more perplexed or 454 00:35:32,001 --> 00:35:37,001 concerned. So we want to think about how that actually influences people. And why 455 00:35:37,001 --> 00:35:41,001 does body language potentially matter? Body language and tone of voice could 456 00:35:41,001 --> 00:35:46,000 matter through a direct channel. So a trader or us individuals may be watching 457 00:35:46,000 --> 00:35:49,001 directly, and then we take actions based on what it is we're hearing. We're 458 00:35:49,001 --> 00:35:53,001 either actively looking for something, or it may affect how we feel about 459 00:35:53,001 --> 00:35:57,001 something or sentiment or confidence. If we're looking at indirect channels, 460 00:35:57,001 --> 00:36:02,001 because these get picked up and amplified through other TV stations and other 461 00:36:02,001 --> 00:36:06,000 news outlets, what we can basically get is that the Fed has a 462 00:36:06,000 --> 00:36:08,001 communication, the journalists or the analysts are 463 00:36:08,001 --> 00:36:10,000 watching it, and then they're releasing information. 464 00:36:10,001 --> 00:36:13,001 And now those of us who are receiving that information, it could be broken 465 00:36:13,001 --> 00:36:18,000 telephone, may not be, but we'll take actions upon that themselves. So you can 466 00:36:18,000 --> 00:36:23,000 see here, for example, a quote that came from the CNBC street signs that was 467 00:36:23,000 --> 00:36:27,000 talking about when Janet Yellen looked insecure, when she looked frustrated, when 468 00:36:27,000 --> 00:36:31,001 she looked angry, and even when Ben Bernanke's voice potentially got shaky at 469 00:36:31,001 --> 00:36:36,001 various times. And then there's also the impact of algorithmic traders, where now 470 00:36:36,001 --> 00:36:39,001 we have an automated channel where they may be taking a lot of these signals, 471 00:36:39,001 --> 00:36:43,001 analyzing them and trading on their behalf. So what our research tries to do is 472 00:36:43,001 --> 00:36:47,000 look at these different channels, we're not going to take a stand on which one is 473 00:36:47,000 --> 00:36:50,001 most important. But we want to take a look at Fed chair testimonies. And this is 474 00:36:50,001 --> 00:36:54,000 a little different than some of the other papers that have been looking at Fed 475 00:36:54,000 --> 00:36:58,000 communications, because we're focusing on these 32 testimonies, which will expand 476 00:36:58,000 --> 00:37:01,001 over the summer. And we're looking at the three different dimensions of 477 00:37:01,001 --> 00:37:05,001 communications all at once. Now, what's interesting about these communications is 478 00:37:05,001 --> 00:37:10,000 there's also a prepared portion of this. And we have unscripted question and 479 00:37:10,000 --> 00:37:14,000 answers. And people can obviously respond to those things in different ways. We 480 00:37:14,000 --> 00:37:17,001 will also have the responses from the senators and the congressional 481 00:37:17,001 --> 00:37:22,001 representatives. Now, why testimonies, you might ask, well, first of all, they're 482 00:37:22,001 --> 00:37:26,000 widely watched by Fed watchers and investors. It's 483 00:37:26,000 --> 00:37:27,001 covered by a lot of news media. 484 00:37:28,000 --> 00:37:31,001 It doesn't occur on the same day as an actual policy announcement. So you can 485 00:37:31,001 --> 00:37:35,001 focus a little bit more on disentangling just the communications component. And 486 00:37:35,001 --> 00:37:38,001 of course, the high quality TV footage and transcripts that are available from 487 00:37:38,001 --> 00:37:43,001 the archives has made this actually a doable thing at the moment. So the 488 00:37:43,001 --> 00:37:46,000 structure of the semi annual testimony is these things are about two to three 489 00:37:46,000 --> 00:37:50,000 hours, and they come in pairs. So the semi annual testimonies, you end up with 490 00:37:50,000 --> 00:37:54,001 one before the house and one before the Senate Banking Committee, you have the 491 00:37:54,001 --> 00:37:58,001 prepared remarks released before things begin, then you'll have some welcoming 492 00:37:58,001 --> 00:38:02,001 statements, the prepared remarks in the Q&A. What we can do then is we can 493 00:38:02,001 --> 00:38:06,001 disentangle the different components here, we have transcripts, which are going 494 00:38:06,001 --> 00:38:10,001 to give us the actual words that are spoken, we have the audio component, and 495 00:38:10,001 --> 00:38:14,000 then we have the video component, which will allow us to look at how facial 496 00:38:14,000 --> 00:38:18,001 expressions are formed. So here, the three different ways, basically from the 497 00:38:18,001 --> 00:38:22,000 text, without getting into details, we're using a fine trained BERT model, which 498 00:38:22,000 --> 00:38:26,000 is a language model that sort of state of the art right now from the voice, we're 499 00:38:26,000 --> 00:38:31,000 basically forcing alignments with transcripts, and we're extracting pitch for 500 00:38:31,000 --> 00:38:37,000 each speaker, with a tool called Pratt. And for the facial expressions, we're 501 00:38:37,000 --> 00:38:41,000 using things that would be used in laboratories and psychology. So face reader, 502 00:38:41,001 --> 00:38:46,000 for example, we're using Microsoft Azure, and we're matching these to look at 503 00:38:46,000 --> 00:38:50,000 facial action units, and then using mappings that have been done with 504 00:38:50,000 --> 00:38:54,001 psychologists before, into figuring out what kind of negative emotions might be 505 00:38:54,001 --> 00:38:58,000 being expressed by the Federal Reserve Chairs during this time. Now, all in all, 506 00:38:58,000 --> 00:39:01,001 we take all of this, we then have to mash it in with financial market data, we 507 00:39:01,001 --> 00:39:03,001 have to be very careful about the timing of this. 508 00:39:03,001 --> 00:39:09,000 So we use the CNBC live coverage, and we're watching what the values of the S&P 509 00:39:09,000 --> 00:39:14,001 500 are, in order to do some of that crosswalk in the timestamps. Then what we do 510 00:39:14,001 --> 00:39:18,001 is we basically create this giant database, we're going to use local projection 511 00:39:18,001 --> 00:39:23,001 regressions and difference in different type projections, in order to look at the 512 00:39:23,001 --> 00:39:27,001 impact of these different channels of emotions on the outcomes, which would be 513 00:39:27,001 --> 00:39:31,000 say the change in the S&P 500, or what's happening with the VIX. 514 00:39:32,000 --> 00:39:35,000 Now, what we typically find, and this is just one of the examples, is yes, 515 00:39:35,000 --> 00:39:39,001 indeed, the S&P 500 and the VIX tend to respond in ways that you might expect. So 516 00:39:39,001 --> 00:39:45,000 if they're actually relaxed and happy, not negative, you end up seeing the values 517 00:39:45,000 --> 00:39:49,000 of the stock markets going up in the short run, and the value of the VIX, which 518 00:39:49,000 --> 00:39:51,001 is usually considered to be a fear index going down. 519 00:39:52,001 --> 00:39:55,001 Now, what you also see is there are differences across different topics, and as 520 00:39:55,001 --> 00:39:59,000 you might expect, the ones that the Fed has most oversight or control on, such as 521 00:39:59,000 --> 00:40:03,000 monetary policy, will have the largest impact on markets over the time period. 522 00:40:04,000 --> 00:40:06,001 Now, the summaries of the findings today, because I did promise Chris, I would 523 00:40:06,001 --> 00:40:10,001 try to get in on time, is what we did find is these different dimensions, they 524 00:40:10,001 --> 00:40:14,000 don't all act at the same time in a highly correlated fashion. Text, voice, and 525 00:40:14,000 --> 00:40:18,001 facial emotions, though, do tend to move financial markets. I showed you a couple 526 00:40:18,001 --> 00:40:22,000 of examples for the short run responses, but the impact seems to grow stronger 527 00:40:22,000 --> 00:40:23,001 in the days following these testimonies. 528 00:40:24,000 --> 00:40:27,001 There's differences in the amount of soft information, how they can differ across 529 00:40:27,001 --> 00:40:31,000 topics, and we also are starting to see some patterns that we're going to be 530 00:40:31,000 --> 00:40:34,001 verifying over the summer as to whether or not some of these responses differ 531 00:40:34,001 --> 00:40:39,001 from different Fed chairs. We also have congressional members' emotions, and 532 00:40:39,001 --> 00:40:42,001 sometimes they will significantly move markets as well. 533 00:40:43,001 --> 00:40:47,000 All right, so with that, I say thank you. Comments and questions are welcome. 534 00:40:47,000 --> 00:40:51,000 Again, there's a link to a working paper, and you can reach out to me via email. 535 00:40:51,000 --> 00:40:56,001 I'll be happy to have a conversation. Thank you very much. Thank you, Michelle. 536 00:40:57,000 --> 00:41:00,001 We have gotten a couple of questions that have come in to me. Again, if you have 537 00:41:00,001 --> 00:41:04,000 additional questions for Michelle, please drop them off in the Q&A. But what we'd 538 00:41:04,000 --> 00:41:09,001 like to do now is move on to our next speaker. Again, we'll pick up questions 539 00:41:09,001 --> 00:41:15,001 with both Sawood and Michelle in just a minute. But today, right now, I want to 540 00:41:15,001 --> 00:41:19,001 welcome Sawood along to the screen. Sawood works for the Internet Archive. He's 541 00:41:19,001 --> 00:41:25,001 part of the Wayback Machine team doing amazing things. He's going to walk you 542 00:41:25,001 --> 00:41:30,000 through some of the work that he's doing on metadata behind the scenes with our 543 00:41:30,000 --> 00:41:33,000 web archive. Sawood, please, over to you. 544 00:41:33,001 --> 00:41:36,001 Hi. Allow me to share my screen. 545 00:41:40,000 --> 00:41:46,001 Taking a while. Okay, that's top one. 546 00:41:50,001 --> 00:41:55,000 Is my screen visible? Looks great. Awesome. Thank you. 547 00:41:55,001 --> 00:41:58,000 [...] Sawood Alam, a web and data scientist of 548 00:41:58,000 --> 00:42:00,000 Wayback Machine at the Internet Archive. 549 00:42:01,000 --> 00:42:06,000 What you see on my screen is the help text of a tool called CDX summaries. 550 00:42:07,000 --> 00:42:12,001 In my current working directory, I have a file called sample.cdx.gc, which is a 551 00:42:12,001 --> 00:42:18,001 compressed index of a bunch of web archive files. But we don't know what was in 552 00:42:18,001 --> 00:42:24,001 those web archive files or work files. Hopefully, when we run this index through 553 00:42:24,001 --> 00:42:28,001 our CDX summary tool, we'll get to know what is inside of there. I'm going to run 554 00:42:28,001 --> 00:42:33,000 this command here, and it will take a while to complete. So we'll come back to it 555 00:42:33,000 --> 00:42:39,000 later. Let's go to the Internet Archive homepage. Here we see a bunch of 556 00:42:39,000 --> 00:42:44,001 collections listed. Some are collections of books and audio and video. Some are 557 00:42:44,001 --> 00:42:50,001 web collections. Let's open one of these collections. In this case, it is 558 00:42:50,001 --> 00:42:56,000 Ukrainian cultural heritage collection. You scroll through this collection. You 559 00:42:56,000 --> 00:43:02,000 see some interesting colorful thumbnails, some meaningful titles, and you have a 560 00:43:02,000 --> 00:43:08,000 rough idea of what to see or to get in this collection. Now, let's go to a web 561 00:43:08,000 --> 00:43:13,001 collection, which is a collection of WAP files. An arbitrary number of web 562 00:43:13,001 --> 00:43:19,001 captures are packaged together. And in this case, we see lists of tombstones here 563 00:43:19,001 --> 00:43:26,000 as thumbnails. And titles look very similar. I guess they were generated by 564 00:43:26,000 --> 00:43:31,001 machines. It doesn't tell a lot. There is an about tab in this collection view, 565 00:43:31,001 --> 00:43:33,000 but we don't see much. 566 00:43:33,001 --> 00:43:38,000 There are a few lines of metadata here and that's all. And this is the problem 567 00:43:38,000 --> 00:43:44,001 that you are going to solve here. In the next few days or weeks, when the testing 568 00:43:44,001 --> 00:43:50,000 is done and the changes are merged to the main site, this page will look 569 00:43:50,000 --> 00:43:57,000 something like this. We'll get to know how many captures are in this 570 00:43:57,000 --> 00:44:02,001 collection, how many unique URIs were captured, how many of those URIs belong to, 571 00:44:02,001 --> 00:44:09,001 say, SGMN pages in 200 okay answers, for example, or how many of those URLs have 572 00:44:09,001 --> 00:44:16,000 zero paths and zero queries that is root pages versus deep links. We can also get 573 00:44:16,000 --> 00:44:17,001 a total spread of the collection. 574 00:44:18,000 --> 00:44:24,001 So this collection started in 2016 and stopped there. No activity in 2017 575 00:44:24,001 --> 00:44:30,000 and then started again 2018. And since then, it has been constantly archiving 576 00:44:30,000 --> 00:44:35,000 every single month. Then we also have a bunch of top posts that are contributing 577 00:44:35,000 --> 00:44:37,001 to this collection with maximum number of captures. 578 00:44:38,000 --> 00:44:45,000 And one cool thing that the tool brings up is a bunch of sample URIs from this 579 00:44:45,000 --> 00:44:51,000 collection. So WAC files are bundled up a number of URIs that you don't know what 580 00:44:51,000 --> 00:44:55,001 is inside in there. And this allows us to click on one of these links and see how 581 00:44:55,001 --> 00:45:00,000 well those pages were archived. So it helps us doing QA, for example. All these 582 00:45:00,000 --> 00:45:06,000 numbers appear. They also allow us to gain insights and learn sometimes if the 583 00:45:06,000 --> 00:45:10,000 numbers are really off, we learn like there was some problem in our calling 584 00:45:10,000 --> 00:45:14,001 process and go back and fix it. So how this whole page was rendered is basically 585 00:45:14,001 --> 00:45:20,001 backed by a JSON file. And this JSON file is generated using a command line tool 586 00:45:20,001 --> 00:45:25,001 called CDX summaries. And we have open sourced this tool. So you can use this 587 00:45:25,001 --> 00:45:32,000 tool to generate human readable summaries or machine readable JSON files. And if 588 00:45:32,000 --> 00:45:37,001 you have a JSON file, you can render it in HTML using this web component that we 589 00:45:37,001 --> 00:45:42,001 made available. So with that, let's go back to the command line and see what 590 00:45:42,001 --> 00:45:48,000 happened to the local CDX file that we had. And it turned out it did generate a 591 00:45:48,000 --> 00:45:54,000 very nice human readable summary of the CDX index that we had. Now we get to know 592 00:45:54,000 --> 00:45:58,000 what is inside in there, when it was captured, and how many captures are there, 593 00:45:58,000 --> 00:46:03,001 and so on and so forth. And we can even click on one of these links to open them 594 00:46:03,001 --> 00:46:08,000 in a browser and see how it was archived. But by demonstrating it on command 595 00:46:08,000 --> 00:46:12,001 line, what we really illustrated is this tool is not tied to internet archives. 596 00:46:13,001 --> 00:46:17,000 It is an independent tool, actually. We are just using an internet archive. 597 00:46:17,001 --> 00:46:22,000 Anyone can use it in their own web archival collection. So with that, I will 598 00:46:22,000 --> 00:46:29,000 recap. We basically created a command line tool and a companion web component 599 00:46:29,000 --> 00:46:35,000 to summarize the web collections. We released these tools under open source 600 00:46:35,000 --> 00:46:40,001 license. And we use these tools to enrich our own web collections. 601 00:46:41,001 --> 00:46:47,001 We gain insight from generated stats to improve our color. Finally, we 602 00:46:47,001 --> 00:46:52,000 demonstrated that these tools are not exclusive to the internet archive. So 603 00:46:52,000 --> 00:46:56,001 anyone can use them to their own web archival collection and learn more about 604 00:46:56,001 --> 00:47:01,001 what they are having. With that, I will see you in QA. And thank you so much for 605 00:47:01,001 --> 00:47:07,000 listening. Thank you so much. So we would, you know, lots of comments and 606 00:47:07,000 --> 00:47:12,000 questions coming in in the chat. So I think we will have a good question for you 607 00:47:12,000 --> 00:47:18,001 in just a minute. But up next, we have the final bit of this segment is a 608 00:47:18,001 --> 00:47:23,001 presentation, a video from Art Rhino. And I had the chance of watching this 609 00:47:23,001 --> 00:47:27,000 video. And I think it is going to blow your mind. It is a little Willy Wonka and 610 00:47:27,000 --> 00:47:33,000 just charming and wonderful. So Art is going to show you in this video kind 611 00:47:33,000 --> 00:47:39,001 of the internet archives scanning frames, the TT scanner. 612 00:47:40,000 --> 00:47:42,000 So, Caitlin, let's roll the video. 613 00:47:45,001 --> 00:47:48,001 Hi, there. My name is Art Rhino. I work at the University of 614 00:47:48,001 --> 00:47:50,000 Windsor in Ontario, Canada. 615 00:47:51,000 --> 00:47:55,001 And I'm on the board of a nonprofit organization called our Digital World or ODW. 616 00:47:56,000 --> 00:48:00,000 My library acquired a very nice tabletop scribe scanning machine from 617 00:48:00,000 --> 00:48:01,001 the internet archive years ago. 618 00:48:01,001 --> 00:48:06,000 This unit currently has good availability. And I've become interested in scanning 619 00:48:06,000 --> 00:48:10,000 the back file for a major paper collection. A lot of the material is well 620 00:48:10,000 --> 00:48:14,001 positioned for scanning, some of it can be disassembled and the pages can be fed 621 00:48:14,001 --> 00:48:19,001 through our Xerox scan feature. However, there's a lot of stitch binding. I've 622 00:48:19,001 --> 00:48:23,001 been trying to find a way to scan it without compromising the binding. Here are 623 00:48:23,001 --> 00:48:27,000 the components I've used to pull together automatic scanning, most of which were 624 00:48:27,000 --> 00:48:33,001 already on hand from ODW's work on DIY microfilm scanners. When new addition is 625 00:48:33,001 --> 00:48:38,001 an inflatable snow tube, which came from the end of season table at a local 626 00:48:38,001 --> 00:48:43,001 hardware store, the air mattress pump inflates the snow tube and a vacuum cleaner 627 00:48:43,001 --> 00:48:48,000 holds on to the page. The device in the front that looks like a car is the MBOT 628 00:48:48,000 --> 00:48:54,000 or M-Block MBOT. My MBOT is several years old. And I use the M-Block ID on the 629 00:48:54,000 --> 00:48:58,001 web for controlling it. There are wireless and Bluetooth options for newer models 630 00:48:58,001 --> 00:49:02,001 that would be more elegant for this kind of thing. However, this worked well 631 00:49:02,001 --> 00:49:06,001 enough and it seems to do the trick for what we want to do. Probably the biggest 632 00:49:06,001 --> 00:49:10,001 drawback of all this is the noise from the air mattress pump and the vacuum 633 00:49:10,001 --> 00:49:15,000 cleaner, which you can't hear here. But otherwise, I'm pretty happy with this 634 00:49:15,000 --> 00:49:19,000 approach. It takes about 30 to 45 minutes to scan the typically 635 00:49:19,000 --> 00:49:20,001 100 pages in the title. 636 00:49:20,001 --> 00:49:25,001 You can find way more sophisticated and faster devices on YouTube. And it would 637 00:49:25,001 --> 00:49:29,001 be cool to make something more general purpose and robust. But this is one 638 00:49:29,001 --> 00:49:33,001 approach for automatically scanning material on the tabletop scribe that is 639 00:49:33,001 --> 00:49:36,001 hopefully useful for others. Thanks for listening. 640 00:49:39,001 --> 00:49:44,000 So I have it on good authority that art does a lot of really interesting 641 00:49:44,000 --> 00:49:50,001 projects. And as you can see, as a little bit of a tinkerer and just what 642 00:49:50,001 --> 00:49:54,001 creativity. I think that someone mentioned like a rotisserie chicken turning 643 00:49:54,001 --> 00:50:00,001 thing next to a microfilm scanner. Anyway, interesting stuff there. 644 00:50:01,000 --> 00:50:06,001 I would like to welcome or I'd like to bring Michelle and Sowood back to the 645 00:50:06,001 --> 00:50:11,001 screen for a couple of questions. And so the first question that I have, if you 646 00:50:11,001 --> 00:50:17,001 want to turn your cameras on, Michelle, the question that I have is for you. Are 647 00:50:17,001 --> 00:50:21,001 there data points or related data that you'd like to include in your research 648 00:50:21,001 --> 00:50:26,000 that simply weren't or aren't available in digital or computable form? 649 00:50:27,000 --> 00:50:32,001 So that's a great question. There are holes in the collection. Unfortunately, the 650 00:50:32,001 --> 00:50:37,001 Internet Archive hasn't gone back quite far enough for what we'd love to do, 651 00:50:37,001 --> 00:50:42,001 which is kind of go back sort of to the beginning to include a lot more of the 652 00:50:42,001 --> 00:50:49,000 data from, say, Greenspan's time period or even people passed on 653 00:50:49,000 --> 00:50:54,000 that. And sometimes some of the audio files and some other forms of 654 00:50:54,000 --> 00:50:58,000 communications that could have happened around the time. So some of that's not 655 00:50:58,000 --> 00:51:02,001 available. Some we're still trying to dig through some archives to see whether or 656 00:51:02,001 --> 00:51:07,001 not they've been put someplace else or sort of misplaced on some of that. But 657 00:51:07,001 --> 00:51:11,001 there is the unfortunate thing that I mean, why the Internet Archive is so 658 00:51:11,001 --> 00:51:16,000 important that things have gotten destroyed over time. So we would really like to 659 00:51:16,000 --> 00:51:20,001 have it as inclusive as possible. We'd also like to look at some of the issues 660 00:51:20,001 --> 00:51:25,000 between Janet Yellen and other female Fed authorities and whether or not there's 661 00:51:25,000 --> 00:51:29,001 different responses on them. And it's just sometimes the video feeds and 662 00:51:29,001 --> 00:51:33,001 everything like that just aren't there right now. So those would be the primary 663 00:51:33,001 --> 00:51:37,000 ones. We're getting much better at getting stock data and everything coming in. 664 00:51:37,001 --> 00:51:42,000 So that's not quite so much of a problem. Fascinating. Thanks so much for that. 665 00:51:42,001 --> 00:51:46,000 So, Wud, a question for you. The comment is, thank you for making 666 00:51:46,000 --> 00:51:48,000 your CDX tool open source. 667 00:51:49,000 --> 00:51:52,000 What other projects would you envision using your tool? 668 00:51:54,000 --> 00:52:00,000 Ah, interesting. So one immediate application I was thinking the other day is 669 00:52:00,000 --> 00:52:06,001 there is an emerging format called Wack-Z, which is like 670 00:52:06,001 --> 00:52:11,001 bundling a bunch of WACK files with their indices and all that stuff. I think 671 00:52:11,001 --> 00:52:16,000 this tool can integrate well inside in there. So if someone kind of loads a Wack 672 00:52:16,000 --> 00:52:21,000 -Z file in a browser, they can see what really is inside. And it has been a 673 00:52:21,000 --> 00:52:26,001 challenge like I mean, Mark Graham, director of the Bayback Machine, he often 674 00:52:26,001 --> 00:52:31,001 asked me, okay, so here is this item we collected, what is inside in there? And 675 00:52:31,001 --> 00:52:35,001 you know, it's a tool go around against it. And you will have like a bunch of 676 00:52:35,001 --> 00:52:41,000 URLs that you can play with and explore from there. Again, I mean, this is both 677 00:52:42,001 --> 00:52:47,001 statistical insights, as well as kind of exploratory. Another option I've been 678 00:52:47,001 --> 00:52:54,001 looking forward for is to kind of know these random URLs that we pull 679 00:52:54,001 --> 00:53:00,000 out from these. And those are not just random, they have some other criteria to 680 00:53:00,000 --> 00:53:05,001 that. We can use those to generate thumbnails and have like a slideshow or 681 00:53:05,001 --> 00:53:11,001 something like that to have a more visual understanding of what's inside in a 682 00:53:11,001 --> 00:53:16,000 collection and use it as a way to kind of tell a story about a collection. 683 00:53:17,001 --> 00:53:23,001 That's bad. Thank you for that explanation. I understand from a message from 684 00:53:23,001 --> 00:53:26,001 Michael Nelson in chat that you're going to be teaching a course next semester. 685 00:53:27,001 --> 00:53:32,001 Oh, yes, it will be a web server design. So yeah, my students will be competing 686 00:53:32,001 --> 00:53:38,001 with the engine X and Apache and some of other fancy web servers, hoping to have 687 00:53:38,001 --> 00:53:45,001 a more, you know, compliant web server with RFC or standard then 688 00:53:45,001 --> 00:53:52,001 some of these other well known web server. Well, good luck 689 00:53:52,001 --> 00:53:58,001 to you in the in the course and your students are in excellent hands for sure. 690 00:53:59,000 --> 00:54:03,001 Michelle, so we thank you so much for your presentations today. And we'd like to 691 00:54:03,001 --> 00:54:07,001 keep moving through the through today's talks. If you had additional questions 692 00:54:07,001 --> 00:54:14,000 for Michelle, or so would please do drop them off into the Q&A. But what 693 00:54:14,000 --> 00:54:20,000 Spencer Torin to the to the screen from Thompson Reuters Special Services and 694 00:54:20,000 --> 00:54:24,000 Spencer is going to talk about the work that he's done with automatic hashtag 695 00:54:24,000 --> 00:54:29,000 hierarchy generation. And I can't wait to hear that. So Spencer, over to you. 696 00:54:32,000 --> 00:54:38,001 Yep. Okay, hello. 697 00:54:39,000 --> 00:54:45,001 Welcome, everyone. Yeah, so I'd like to talk about, well, what Chris just said. 698 00:54:46,000 --> 00:54:51,001 So the idea is, you know, we have all these tags hashtags on Twitter. And the 699 00:54:51,001 --> 00:54:57,000 question is, do they have a hierarchy is their structure to them. And I'll talk a 700 00:54:57,000 --> 00:55:01,000 little bit about the difference between ontologies and folks on amis really 701 00:55:01,000 --> 00:55:06,000 quickly. So we're all probably if not familiar with ontologies by name, at least 702 00:55:06,000 --> 00:55:11,000 by concept. So we have things like the dictionary, shall we say, in the 703 00:55:11,000 --> 00:55:15,001 dictionary, there are very curated terms, there are rigorous definitions of these 704 00:55:15,001 --> 00:55:19,000 terms. And there are, you know, well defined relations between these terms. So an 705 00:55:19,000 --> 00:55:23,001 example is a foot has five toes and is part of the human body, there are many 706 00:55:23,001 --> 00:55:27,001 more definitions of what a foot is, you can see on the right, and I think on the 707 00:55:27,001 --> 00:55:32,000 right is just some of them. But we have to compare that with the difficulty of 708 00:55:32,000 --> 00:55:39,000 looking at, let's say, Twitter, in folks 709 00:55:39,000 --> 00:55:45,000 on ami. So these hashtags that are used are very arbitrary. Their definitions are 710 00:55:45,000 --> 00:55:52,000 circumstantial. There are very undefined relationships to them. I just created a 711 00:55:52,000 --> 00:55:55,001 hashtag down there to hashtags down there. And if I use them on Twitter, 712 00:55:55,001 --> 00:55:57,001 they would now exist in the record. 713 00:55:58,001 --> 00:56:03,000 What do some of those hashtags even mean? I don't even know, really, I just put 714 00:56:03,000 --> 00:56:08,000 up some nonsense just to illustrate the fact that I could do it. So, you know, if 715 00:56:08,000 --> 00:56:13,000 we want to put some structure to these hashtags, we have to be able to ask at 716 00:56:13,000 --> 00:56:16,000 least the minimal question, which is, you know, are there hashtags which are more 717 00:56:16,000 --> 00:56:21,001 general than others, shall we say? Is hashtag dog more general than Dalmatian? If 718 00:56:21,001 --> 00:56:26,000 dog in our in our ontological thinking is more general than Dalmatian, is that 719 00:56:26,000 --> 00:56:30,000 true on Twitter? Is AI hashtag AI more general than hashtag machine learning? 720 00:56:30,001 --> 00:56:37,000 What does general mean on Twitter? An ontology would say that a general word 721 00:56:37,000 --> 00:56:42,000 is perhaps due to the breadth of applicability of the term. And that again would 722 00:56:42,000 --> 00:56:47,001 be by the definition. There's nothing to say that you could or should use a term 723 00:56:47,001 --> 00:56:52,001 very often. But by the definition of the term, you could say that the term is 724 00:56:52,001 --> 00:56:58,000 general. With folks on amis, since there's no real clear definition for these 725 00:56:58,000 --> 00:57:02,000 things, we probably have to take the question in a different direction and talk 726 00:57:02,000 --> 00:57:06,000 about the breadth of application rather than the breadth of applicability. So, 727 00:57:06,000 --> 00:57:09,000 we're going to ask certain contextual questions, like, you know, when was it 728 00:57:09,000 --> 00:57:15,001 used? Where was it used? Who used it? How was it used? So, we look at contexts as 729 00:57:15,001 --> 00:57:19,000 proxies for generality. So, there's a classic one, 730 00:57:19,000 --> 00:57:20,001 which would be the popularity context. 731 00:57:21,000 --> 00:57:25,000 If a lot of people are using it, that might indicate it's for general use. There 732 00:57:25,000 --> 00:57:31,001 are lots of contexts. Here are examples. Hashtag disease would be not 733 00:57:31,001 --> 00:57:36,001 related to a specific event, although COVID-19 is. It's a big deal. It's a big 734 00:57:36,001 --> 00:57:42,000 event. But it's nonetheless an event. And in some years, probably we'll see that 735 00:57:42,000 --> 00:57:47,000 hashtag die out. You can have a general holiday, which could happen any time of 736 00:57:47,000 --> 00:57:51,000 the year. Halloween is generally one time of year. You could have hashtag work, 737 00:57:51,001 --> 00:57:55,000 which occurs throughout the week. TGIF might be a Friday thing. And you could 738 00:57:55,000 --> 00:58:00,001 certainly have the hashtag food any time of the day and breakfast maybe more so 739 00:58:00,001 --> 00:58:06,000 in the morning. And then the all-important semantic contexts. So, you know, if 740 00:58:06,000 --> 00:58:11,001 you have a hashtag that co-occurs with many other different hashtags, this is 741 00:58:11,001 --> 00:58:15,000 sort of the state of the art of the moment, you know, which other, how many other 742 00:58:15,000 --> 00:58:20,000 different unique hashtags is a hashtag used with. So, if very 743 00:58:20,000 --> 00:58:21,001 many, then we could call it general. 744 00:58:21,001 --> 00:58:25,000 And if not very many, then maybe not so general. And then we can also look at 745 00:58:25,000 --> 00:58:29,000 topics, which you might consider to be groups of hashtags. Is it used 746 00:58:29,000 --> 00:58:30,001 within a group or outside the group? 747 00:58:31,001 --> 00:58:38,001 And also tokens and words. So, is it used with many different ideas or not? And 748 00:58:38,001 --> 00:58:42,000 the way we measure all this is basically we call it the, well, we don't call it 749 00:58:42,000 --> 00:58:45,001 the Shannon diversity index is what it is called. This is an ecologically 750 00:58:45,001 --> 00:58:51,001 inspired term. It is just Shannon entropy. For those of you who are not familiar 751 00:58:51,001 --> 00:58:56,000 with Shannon entropy, I will briefly describe it. So, on the left, you see 752 00:58:56,000 --> 00:59:01,000 something with very low entropy, very low diversity. You kind of know what you're 753 00:59:01,000 --> 00:59:05,000 getting with that thing, where it fits, what context it's used in. And then over 754 00:59:05,000 --> 00:59:09,000 on the very far right, you would have no idea when and where you might see this 755 00:59:09,000 --> 00:59:14,001 thing. So, it would be, you might consider it to be very diverse. And then let me 756 00:59:14,001 --> 00:59:17,001 just take, you know, I mentioned the eight different contexts a few slides 757 00:59:17,001 --> 00:59:22,000 previously. That's what, you know, you have eight of these diversity measures for 758 00:59:22,000 --> 00:59:26,000 each hashtag. And then you just multiply them by some weighting and you come up 759 00:59:26,000 --> 00:59:31,000 with this ensemble diversity index. And that value would be an indication of how 760 00:59:31,000 --> 00:59:35,001 general that hashtag is. The higher the value, the more general it is. The lower 761 00:59:35,001 --> 00:59:38,000 the value, the less general it is, the more specific. 762 00:59:39,000 --> 00:59:41,001 So, I'll get to the data, the all-important data. So, you know, we went to 763 00:59:41,001 --> 00:59:48,001 archive for it. We used Twitter's 1% Spritzer stream. And we got 52 months 764 00:59:48,001 --> 00:59:54,000 of data between October 2016 and December 2021. 46 million English language 765 00:59:54,000 --> 01:00:00,000 tweets. We have many more than that in other languages. And 360,000 hashtags. So, 766 01:00:00,000 --> 01:00:04,000 we took all that data. We boiled it down into a hashtag network, the co-occurring 767 01:00:04,000 --> 01:00:10,001 hashtags. And then we used this network to calculate a lot of our measures. So, 768 01:00:10,001 --> 01:00:17,000 here's some of the you know, here's the data community or shall we say the top 769 01:00:17,000 --> 01:00:23,001 10 most diverse hashtags within the data community. There are about over 2 770 01:00:23,001 --> 01:00:27,001 ,000 other hashtags in this community alone. But you can see at the top, you have 771 01:00:27,001 --> 01:00:33,001 pretty familiar words there that you may recognize. And then if we look across 772 01:00:33,001 --> 01:00:39,000 other communities of hashtags. So, here's the data community, the beer community, 773 01:00:39,000 --> 01:00:43,000 the coffee community, and then the dog community. At the top, you know, the most 774 01:00:43,000 --> 01:00:47,001 diverse, the most general hashtags are very familiar terms. You would recognize 775 01:00:47,001 --> 01:00:51,001 most all of them. And then at the bottom, the least diverse hashtags, the most 776 01:00:51,001 --> 01:00:56,000 specific, you know, I've never seen these before. They look like they sort of 777 01:00:56,000 --> 01:01:02,001 make sense in the community. But they're very narrowly used. And there's a 778 01:01:02,001 --> 01:01:09,000 beautiful tail here, which is based on our findings. Hashtag love happens to be 779 01:01:09,000 --> 01:01:15,000 the most diverse hashtag, according to our method, according to the archive.org 780 01:01:15,000 --> 01:01:19,000 data that we have. So, that's a beautiful story there. And you can see some of 781 01:01:19,000 --> 01:01:24,001 the related communities to the love community. These are the most 782 01:01:24,001 --> 01:01:30,000 tightly corresponding communities. And, you know, not shown are thousands of 783 01:01:30,000 --> 01:01:34,000 other communities and hundreds of thousands of other hashtags. But you can 784 01:01:34,000 --> 01:01:39,000 certainly, you know, dive into it. So, what can be used? It can be used as a 785 01:01:39,000 --> 01:01:44,000 hashtag recommender. So, if hashtag Dalmatian, maybe also hashtag dog. But the 786 01:01:44,000 --> 01:01:49,000 idea of this general to specific ordering may I suggest that you don't want to do 787 01:01:49,000 --> 01:01:53,001 if hashtag dog, maybe also hashtag Dalmatian, that doesn't really work. You can 788 01:01:53,001 --> 01:01:58,000 look at hierarchical hashtag topic models. So, we know that hashtag AI is part of 789 01:01:58,000 --> 01:02:01,001 the data community. So, that's, you know, good information to know. We also know 790 01:02:01,001 --> 01:02:06,000 that it is the data community. It can do social science investigations. What does 791 01:02:06,000 --> 01:02:10,001 hashtag Minnesota mean? And when was that meant? It certainly was more 792 01:02:10,001 --> 01:02:16,000 geographical prior to May of 2020 with the murder of George Floyd in Minneapolis 793 01:02:16,000 --> 01:02:21,001 when Minnesota was now associated with the Black Lives Matter movement. And you 794 01:02:21,001 --> 01:02:26,001 can use other platforms too. We showed Twitter. But, you know, any tags will do 795 01:02:26,001 --> 01:02:31,000 for this. So, thank you for your time. I appreciate it. And thank you to the 796 01:02:31,000 --> 01:02:35,001 presenters again. And if you're interested, these are the names of the papers 797 01:02:35,001 --> 01:02:37,000 relevant. Thank you. 798 01:02:38,000 --> 01:02:43,001 Thank you so much, Spencer. We have some questions that are coming in. And if you 799 01:02:43,001 --> 01:02:47,000 in the audience have additional questions, please do drop them into the Q&A. The 800 01:02:47,000 --> 01:02:53,000 one thing that I was just struck by that data, beer, coffee, dog hashtags are 801 01:02:53,000 --> 01:02:58,001 life. So, hold tight there for a second. Before we go on, I want to acknowledge, 802 01:03:00,000 --> 01:03:04,001 I'm remiss in not acknowledging Art Rhino, who is in the audience today, that 803 01:03:04,001 --> 01:03:09,001 fantastic video that we just watched. Art is here watching along with us. So, 804 01:03:09,001 --> 01:03:15,001 thank you, Art, for that submission and for your creative work. Up next in our 805 01:03:15,001 --> 01:03:20,001 show is another video. And another presenter who is also in the audience. So, 806 01:03:21,000 --> 01:03:25,000 we're going to hear from Jim Salmons, who is going to tell us about his Internet 807 01:03:25,000 --> 01:03:30,001 Archive-enabled journey as a digital humanities citizen science. So, Caitlin, 808 01:03:30,001 --> 01:03:36,001 let's roll. Hi, I'm Jim Salmons. And this lightning talk is a whirlwind trip 809 01:03:36,001 --> 01:03:41,000 through my post cancer journey of rebirth as a digital humanities citizen 810 01:03:41,000 --> 01:03:46,000 scientist and how the inspiration of and access to the Internet Archive made that 811 01:03:46,000 --> 01:03:52,001 possible. Starting in 2012 through 2014, both my wife, Timon Babitzky, and I had 812 01:03:52,001 --> 01:03:57,001 terrifying cancer battles that we fortunately survived. To pay it forward in 813 01:03:57,001 --> 01:04:03,000 celebration of our 25th wedding anniversary, we funded the digitization of the 48 814 01:04:03,000 --> 01:04:07,001 issues of the Apple computer-focused Soft Talk magazine into the Internet 815 01:04:07,001 --> 01:04:13,001 Archive. Soft Talk has a special place in my heart as I was a reader, advertiser, 816 01:04:14,000 --> 01:04:19,000 writer, and during the time of its explosive growth, an executive at Soft Talk 817 01:04:19,000 --> 01:04:23,000 Publishing where I designed and helped develop the software that ran the back 818 01:04:23,000 --> 01:04:28,000 office production and advertising processes. As part of our initial Soft Talk 819 01:04:28,000 --> 01:04:32,000 preservation project, Timon and I went to the Midwest Scanning Center of the 820 01:04:32,000 --> 01:04:37,001 Archive where we saw firsthand and participated in the amazing behind-the-scenes 821 01:04:37,001 --> 01:04:43,000 scanning service that puts digital collections into the Internet Archive. Not 822 01:04:43,000 --> 01:04:47,000 wanting our contribution to end with simply getting the digital edition of Soft 823 01:04:47,000 --> 01:04:51,000 Talk into the Archive, we followed up with activity that was the true beginning 824 01:04:51,000 --> 01:04:55,000 of my rebirth as a digital humanities citizen scientist. 825 01:04:55,001 --> 01:05:00,001 As I learned about the challenges of text and data mining of digital collections 826 01:05:00,001 --> 01:05:05,000 within the cultural heritage domain, I decided to create a ground truth storage 827 01:05:05,000 --> 01:05:10,001 format that would support an integrated model of a magazine's complex document 828 01:05:10,001 --> 01:05:12,001 structures and content depiction. 829 01:05:13,001 --> 01:05:19,000 Between 2015 through 2019, I developed a personal learning network of largely EU 830 01:05:19,000 --> 01:05:24,000 and UK-based mentors and collaborators to support the development of the magazine 831 01:05:24,000 --> 01:05:30,001 GTS metadata format based on international museum ontology standards. With 832 01:05:30,001 --> 01:05:36,000 posters and papers accepted to EU-based digitization conferences and workshops, 833 01:05:36,001 --> 01:05:42,000 my work initially focused on the structure and content of advertisements in Soft 834 01:05:42,000 --> 01:05:48,000 Talk. Everything came to crashing halt when I suffered a devastating spinal cord 835 01:05:48,000 --> 01:05:54,000 injury in 2020. During the year and a half of my rehab and recovery, the digital 836 01:05:54,000 --> 01:05:57,001 humanities domain saw explosive growth in the use of 837 01:05:57,001 --> 01:05:59,000 machine learning technologies. 838 01:06:00,000 --> 01:06:05,000 As I reinvigorate my research, I have expanded my focus from the Soft Talk 839 01:06:05,000 --> 01:06:09,001 collection to consider the challenges of investigating the massive Internet 840 01:06:09,001 --> 01:06:14,001 Archive collection of computer magazines consisting of tens of thousands of 841 01:06:14,001 --> 01:06:19,001 issues of publications in dozens of languages published all around the world 842 01:06:19,001 --> 01:06:24,001 looking at the impact of computers and emergence of the digital world we live in 843 01:06:24,001 --> 01:06:30,001 today. My initial exploration of this expanded collection is focused on 844 01:06:30,001 --> 01:06:35,001 development of a ground truth dataset of computer magazine talk or table of 845 01:06:35,001 --> 01:06:42,000 contents pages. Talks serve as a Suduco puzzle-like set of hints about the 846 01:06:42,000 --> 01:06:47,000 document structures of a magazine. This dataset will be invaluable for training 847 01:06:47,000 --> 01:06:52,001 machine learning models to help move digitization pipelines from within page to 848 01:06:52,001 --> 01:06:58,000 whole document layout recognition. My goal now is to stimulate collaboration 849 01:06:58,000 --> 01:07:03,001 between my EU and UK research friends with new partners from Stanford Libraries, 850 01:07:04,000 --> 01:07:09,000 its AI lab, the Computer History Museum and the Internet Archive to forge a 851 01:07:09,000 --> 01:07:13,001 research consortium to further preserve and make accessible for scholarly 852 01:07:13,001 --> 01:07:18,000 research and public interest the computer magazine's meta collection at the 853 01:07:18,000 --> 01:07:23,000 Internet Archive. On behalf of myself and Tim Limobitsky, thank you to the 854 01:07:23,000 --> 01:07:28,001 archive and webinar organizers for inviting me to present this lightning talk. 855 01:07:30,001 --> 01:07:36,000 And thanks to you Jim for sharing your inspiring story, your video, your work, 856 01:07:36,001 --> 01:07:41,001 and for sponsoring the digitization of the materials that are now available to 857 01:07:41,001 --> 01:07:48,000 all at the Internet Archive. That your table of contents, TOC, or talk work is of 858 01:07:48,000 --> 01:07:53,000 high interest to us. I know also from my previous work with the Biodiversity 859 01:07:53,000 --> 01:07:57,001 Heritage Library that those tables of contents are really, that's where it's all 860 01:07:57,001 --> 01:08:01,000 at in terms of the structural metadata for the for the article. 861 01:08:01,001 --> 01:08:03,001 So we'll be following up with learn a 862 01:08:03,001 --> 01:08:05,000 little bit more about what you're doing there. 863 01:08:06,000 --> 01:08:12,001 So up next I'd like to welcome Emmanuel Tranos to the screen who's going to tell 864 01:08:12,001 --> 01:08:18,001 us about the relationship that exists between the web and cities. This is a 865 01:08:18,001 --> 01:08:21,001 really interesting talk. I know you're going to love it. So over to you Emmanuel. 866 01:08:24,000 --> 01:08:31,000 Chris, thank you so much for this. Let me try my screen and also tell 867 01:08:31,000 --> 01:08:37,001 you how excited I am to be part of this very cool session. So my name is Emmanuel 868 01:08:37,001 --> 01:08:42,000 Tranos. I'm a reader in quantitative human geography at the University of Bristol 869 01:08:42,000 --> 01:08:47,001 and the Island Security Institute in the UK. And I guess I'm all this rare breed 870 01:08:47,001 --> 01:08:52,001 of geographers who have a fascination with the Internet. So today I'm going to 871 01:08:52,001 --> 01:08:59,000 give you a brief overview of our research using data from the Internet Archive 872 01:08:59,000 --> 01:09:04,001 to understand the link between web and cities. We're using such data to 873 01:09:04,001 --> 01:09:10,001 understand early Internet but also other interesting geographs. So 874 01:09:10,001 --> 01:09:12,000 what data do we use? 875 01:09:13,001 --> 01:09:20,000 We're using data that is curated by the British Library here in the UK. This is a 876 01:09:20,000 --> 01:09:26,000 dataset called the disk UK web domain dataset. And this 877 01:09:26,000 --> 01:09:32,000 contains all the archived web pages from the Internet Archive 878 01:09:32,000 --> 01:09:38,001 under the . UK top level domain during the 1926 to 2012 879 01:09:38,001 --> 01:09:44,001 period. The British Library gives something very clever here. They scanned the 880 01:09:44,001 --> 01:09:51,001 web text of all these archived web pages and created a different, a 881 01:09:51,001 --> 01:09:58,000 further subset which only includes these archived web pages which contain 882 01:09:58,001 --> 01:10:04,000 a UK postcode within their web text. And I believe you can see the link to 883 01:10:04,000 --> 01:10:06,000 these datasets in the top. 884 01:10:06,001 --> 01:10:12,001 So we started our research with almost half a billion of lines which look like 885 01:10:12,001 --> 01:10:19,001 this. We know that this archived URL contains within 886 01:10:19,001 --> 01:10:26,001 its web text this UK postcode. And this postcode refers to a very small area, 887 01:10:26,001 --> 01:10:32,000 almost a block in the UK. And we also know the timestamp of this archival 888 01:10:32,000 --> 01:10:39,000 process. It happened on September, the 9th of September of 2008. So what do we do 889 01:10:39,000 --> 01:10:46,000 with this data? Firstly, we used this data to create a measure 890 01:10:46,000 --> 01:10:50,000 of the online content of local interest. 891 01:10:50,000 --> 01:10:57,000 And we must this data with a large individual survey and were able 892 01:10:57,000 --> 01:11:03,001 to illustrate that the availability of online content of local interest 893 01:11:03,001 --> 01:11:10,000 actually attracts individuals online. We knew a lot about the fact that 894 01:11:10,000 --> 01:11:14,001 poor individuals, you know, to connect to their internet to spend more time 895 01:11:14,001 --> 01:11:19,000 online. But this was the first time we were able to say something about the 896 01:11:19,000 --> 01:11:24,000 actual factors that attract individuals to spend more time online. And 897 01:11:24,000 --> 01:11:28,001 importantly, to do this at a very local scale. 898 01:11:29,001 --> 01:11:36,001 At a different scale, at a different study, we used such data in order to 899 01:11:36,001 --> 01:11:43,000 understand economic clusters. And when I say economic clusters, I refer to these 900 01:11:43,000 --> 01:11:49,000 neighborhoods within cities that host very, you know, specific economic 901 01:11:49,000 --> 01:11:53,001 activities, these very specialized neighborhoods on specific economic activities. 902 01:11:54,000 --> 01:12:01,000 We focused on Sordid. This is a very well known tech cluster here in London. And 903 01:12:01,000 --> 01:12:07,001 we used archived web data in order to map the evolution of 904 01:12:07,001 --> 01:12:14,001 this economic cluster over space and time. But importantly, to map the evolution 905 01:12:14,001 --> 01:12:20,001 of this cluster, also in terms of the types of economic activities that took 906 01:12:20,001 --> 01:12:26,000 place, you know, within this cluster. And we're able, you know, to extract 907 01:12:26,000 --> 01:12:31,000 meaningful types of economic activities, much more detailed than official data 908 01:12:31,000 --> 01:12:33,000 will have enabled us to do so. 909 01:12:35,000 --> 01:12:40,000 Again, we're changing the scale of analysis, and we're moving, you know, from a 910 01:12:40,000 --> 01:12:47,000 small neighborhood in London to the whole of the UK. We utilize, we 911 01:12:47,000 --> 01:12:53,001 employ such archived web data in order to test the economic 912 01:12:53,001 --> 01:13:00,001 effects that the early adoption of web technologies can generate to regions 913 01:13:00,001 --> 01:13:07,000 here in the UK. So we're able to build, you know, measures of the volume 914 01:13:07,000 --> 01:13:14,000 of online commercial content back from 2000. And 915 01:13:14,000 --> 01:13:20,001 we're able to link, you know, these measures from 2000 to specific regions 916 01:13:20,001 --> 01:13:27,000 within the UK. We then utilize econometric techniques, and we're able to 917 01:13:27,000 --> 01:13:34,000 illustrate an interesting lesson. The volume of online content back from 2000, 918 01:13:34,000 --> 01:13:40,000 which to us represents the early adoption of web technologies, is actually 919 01:13:40,000 --> 01:13:46,000 associated with positive economic productivity effects. But these productivity 920 01:13:46,000 --> 01:13:51,001 effects are long-lasting. They are long-term, you know, positive effects that 921 01:13:51,001 --> 01:13:58,001 regions who employ this technology early 922 01:13:58,001 --> 01:14:03,001 on are able to enjoy for longer, for quite lengthy time periods. 923 01:14:05,000 --> 01:14:12,000 And last but not least, at the similar stage, we used such archived web 924 01:14:12,000 --> 01:14:18,000 data, and more specifically, HTML links between commercial websites 925 01:14:18,000 --> 01:14:25,000 to predict trade between different UK regions. 926 01:14:25,001 --> 01:14:31,000 More specifically, we're able to make out of sample predictions using machine 927 01:14:31,000 --> 01:14:37,001 learning algorithms regarding this regional trade clause. What is important? 928 01:14:38,000 --> 01:14:45,000 Because there is hardly any data for trade between regions, such as small estate. 929 01:14:45,001 --> 01:14:49,001 So by using these three-level variable, you know, data that the Internet Archive, 930 01:14:49,001 --> 01:14:55,000 you know, collects for all of us, we're able to make predictions for certain 931 01:14:55,000 --> 01:15:01,000 important policy elements and have local authorities utilize these data that we 932 01:15:01,000 --> 01:15:08,000 generated. So all in all, using these three-level variable data, we're able to 933 01:15:08,000 --> 01:15:13,000 map the evolution, and the geography of the engagement with the Internet, 934 01:15:13,001 --> 01:15:18,000 especially at its early status. And trust me, there is hardly any other data, you 935 01:15:18,000 --> 01:15:21,000 know, that can go so back in time and also so granular. 936 01:15:21,001 --> 01:15:28,000 By doing that, we're able to draw important lessons regarding the deployment of 937 01:15:28,000 --> 01:15:34,000 other future technologies. And also, we're able to understand economic activities 938 01:15:34,000 --> 01:15:40,000 at a very detailed level, both in terms of space and time, but also in terms of 939 01:15:40,000 --> 01:15:46,001 context that take place within and between cities. All of these research 940 01:15:46,001 --> 01:15:52,001 can be found in various papers who have published in public agencies as well that 941 01:15:52,001 --> 01:15:58,000 you can find on my website. Again, thank you so much for having you here today. 942 01:16:00,000 --> 01:16:06,000 Thank you, Emmanuel, for sharing your research. We do have a question for you and 943 01:16:06,000 --> 01:16:12,000 for Spencer, but we want to wrap up today with a final video from our session. So 944 01:16:12,000 --> 01:16:18,000 we're going to welcome Tom Galli back to the screen virtually, and he's going to 945 01:16:18,000 --> 01:16:24,000 share his research and his work into the forgotten novels of the 19th century. 946 01:16:28,000 --> 01:16:32,001 I started reading old books back in the 1960s when I was still a child. 947 01:16:33,001 --> 01:16:39,001 So how did I decide what books to read? Well, we had a lot of books at home, and 948 01:16:39,001 --> 01:16:45,000 I read some of those. There was a small public library nearby, and I like to 949 01:16:45,000 --> 01:16:51,000 browse the shelves there too. And when I got my allowance, I would go to the 950 01:16:51,000 --> 01:16:56,000 local bookstore, look through those shelves, and maybe buy a couple of 951 01:16:56,000 --> 01:16:57,001 paperbacks that caught my eye. 952 01:16:58,000 --> 01:17:02,000 The books I happened to see on those shelves would shape my reading 953 01:17:02,000 --> 01:17:03,001 choices in the years to come. 954 01:17:05,000 --> 01:17:09,001 When I started reading 19th century novels, I naturally gravitated towards 955 01:17:09,001 --> 01:17:16,000 authors and books that I had seen on those shelves. Charles Dickens, Jane Austen, 956 01:17:16,000 --> 01:17:22,000 Nathaniel Hawthorne, Crime and Punishment, The Adventures of Huckleberry Finn. In 957 01:17:22,000 --> 01:17:24,000 other words, the classics. 958 01:17:24,000 --> 01:17:30,001 It was only later, after I had access to a large university library, 959 01:17:31,000 --> 01:17:36,000 that I discovered the vast number of other novels published in the 19th century. 960 01:17:37,000 --> 01:17:42,001 Their pages were yellow and brittle, and most had never been reprinted. But after 961 01:17:42,001 --> 01:17:48,000 I graduated from college, I could no longer access those books, and I was stuck 962 01:17:48,000 --> 01:17:49,001 with the classics again. 963 01:17:51,000 --> 01:17:57,001 So, fast forward to around the year 2010. Libraries around the world were now 964 01:17:57,001 --> 01:18:02,000 scanning their old books, and the Internet Archive was making those books 965 01:18:02,000 --> 01:18:04,000 available online for free. 966 01:18:05,000 --> 01:18:10,001 Now, anyone in the world could read all of those old novels, including the 967 01:18:10,001 --> 01:18:15,000 thousands that publishers had not reprinted and marketed as classics. 968 01:18:16,000 --> 01:18:22,000 So, in 2021, just for fun, I compiled a list of 19th century 969 01:18:22,000 --> 01:18:24,000 novels at the Internet Archive. 970 01:18:24,001 --> 01:18:30,001 I chose only novels that nobody seemed to be reading anymore, the once popular 971 01:18:30,001 --> 01:18:34,000 fiction that had been overlooked by the classics industry. 972 01:18:35,000 --> 01:18:40,001 You can find that list on the Internet Archive's blog under the title, Forgotten 973 01:18:40,001 --> 01:18:45,000 Novels of the 19th Century. I've enjoyed dipping into those 974 01:18:45,000 --> 01:18:47,001 books. Maybe you will too. 975 01:18:52,000 --> 01:18:58,001 So, we'll share that link out to Tom's blog for everyone in the email follow 976 01:18:58,001 --> 01:19:04,000 -up, and I'm sure you'll find some new things that you haven't read for a while. 977 01:19:04,001 --> 01:19:09,000 Thanks Duncan for sharing that out. I would like to bring Spencer and Emmanuel 978 01:19:09,000 --> 01:19:14,000 back to the screen for a couple of questions, and the first one's to Spencer. 979 01:19:14,001 --> 01:19:18,001 It's, have there been changes in how Twitter users use hashtags over time, 980 01:19:18,001 --> 01:19:23,001 especially more recently? That's a really good question. We haven't looked into 981 01:19:23,001 --> 01:19:28,000 it in that respect yet, but that is on the docket. So, you know, again, I'll 982 01:19:28,000 --> 01:19:34,000 reference the Minnesota example where prior to May 2020, all geography, I mean, 983 01:19:34,000 --> 01:19:38,001 has to do with states, capitals, right? No mention of civil rights or anything, 984 01:19:39,000 --> 01:19:43,000 and then after May 2020, very much non-geographical. Well, incidentally 985 01:19:43,000 --> 01:19:47,000 geographical, having to do with Black Lives Matter and the murder of George 986 01:19:47,000 --> 01:19:49,001 Floyd. So, that's an example, but it's a great question 987 01:19:49,001 --> 01:19:51,001 and something we intend to dive into. 988 01:19:52,000 --> 01:19:59,000 Thank you for that, and a follow-up question for Emmanuel. Is your research 989 01:19:59,000 --> 01:20:03,001 methodology extensible to other geographic areas, or would you need to change 990 01:20:03,001 --> 01:20:09,001 your approach for studying areas outside the UK? No, it is absolutely extendable 991 01:20:09,001 --> 01:20:13,000 and replicable outside of the UK. 992 01:20:13,001 --> 01:20:19,001 The main difference is having, you know, available, you know, data sets or 993 01:20:19,001 --> 01:20:23,000 subsets, you know, from pain and archive, and actually this is one of the values 994 01:20:23,000 --> 01:20:29,000 of this data because these are openly available data sources, and by using these 995 01:20:29,000 --> 01:20:34,000 data, what actually outperforms some of the official data sources. 996 01:20:36,000 --> 01:20:41,000 That's fascinating. I know that there are some additional questions, but I'm also 997 01:20:41,000 --> 01:20:47,001 sensitive about our time. So, what I'm going to ask everyone to do is we're going 998 01:20:47,001 --> 01:20:51,000 to have everyone to share their contact information. So, if you have additional 999 01:20:51,000 --> 01:20:56,001 questions for any of our presenters today, maybe we can follow up offline or 1000 01:20:56,001 --> 01:21:03,001 after the session. So, thank you Spencer and Emmanuel for your conversations 1001 01:21:03,001 --> 01:21:09,000 today and for taking some time to field some questions. Thank you much, and also 1002 01:21:09,000 --> 01:21:15,001 thanks to Jim and to Tom for the videos that they offered up, Jim and Timlin as 1003 01:21:15,001 --> 01:21:22,000 well. So, here we are at the end. We've made it. I want to give 1004 01:21:22,000 --> 01:21:27,000 an acknowledgement to the some people who helped guide this series from its 1005 01:21:27,000 --> 01:21:32,001 start, from inception, and we pulled together an advisory group for our series, 1006 01:21:32,001 --> 01:21:38,001 and that included some real heavy hitters in the digital humanities space. So, a 1007 01:21:38,001 --> 01:21:43,001 big thank you to Dan Cohen from Northeastern, to Makiiba Foster from Broward 1008 01:21:43,001 --> 01:21:49,001 County Library, Mike Furlow from HathiTrust, and from Harriet Green at Washington 1009 01:21:49,001 --> 01:21:55,001 University in St. Louis. They've really helped shape the talks that we saw here 1010 01:21:55,001 --> 01:22:00,000 today and help give guidance on how we should frame a conversation, a long tail 1011 01:22:00,000 --> 01:22:05,000 conversation, a longitudinal chat, if you will, among our digital humanities 1012 01:22:05,000 --> 01:22:12,000 scholars. So, thanks to the panelists. I'm 1013 01:22:12,000 --> 01:22:16,001 going to go a little off script. I don't know if Brewster is still on the call, I 1014 01:22:16,001 --> 01:22:22,000 think that he is, but I would like to offer Brewster a chance if he's still 1015 01:22:22,000 --> 01:22:27,000 available and if he's near his computer to come on screen and give us a little 1016 01:22:27,000 --> 01:22:34,000 wrap up. What do you think of everything that you've seen here today and across 1017 01:22:34,000 --> 01:22:41,000 the series? Oh, this is so inspiring and so fun, and it's just to be able to 1018 01:22:41,000 --> 01:22:46,001 see that the light and humorous nature of this, but as well as answering these 1019 01:22:46,001 --> 01:22:52,000 real questions. Of course, I spring to the, we have this more to 1020 01:22:52,000 --> 01:22:58,000 share with you. We have periodicals that have that electronic data processing 1021 01:22:58,000 --> 01:23:00,000 schools, institute schools. 1022 01:23:00,001 --> 01:23:06,001 Anyway, I'm so glad that this is going on. I think the lightning 1023 01:23:06,001 --> 01:23:08,000 format at least works for me. 1024 01:23:09,000 --> 01:23:12,001 So, thank you very much for pulling this together, Chris and everyone. 1025 01:23:13,001 --> 01:23:17,001 Yeah, thanks. I think this, the lightning format, it's nice to have the variety, 1026 01:23:17,001 --> 01:23:22,001 right? You can do some deep dives on some individual topics and then get a broad 1027 01:23:22,001 --> 01:23:26,001 overview of the wealth of materials and the wealth of research that's happening 1028 01:23:26,001 --> 01:23:30,001 to in and around the collections at the Internet Archive. 1029 01:23:31,000 --> 01:23:34,001 What I would say as we wind down here today is this is the start of a 1030 01:23:34,001 --> 01:23:38,000 conversation. Let's think of this as the start of a conversation, not as the end 1031 01:23:38,000 --> 01:23:42,000 of one with this series. Part of what we wanted to do in bringing this together 1032 01:23:42,000 --> 01:23:46,001 was to one, raise awareness just that there are digital humanities projects that 1033 01:23:46,001 --> 01:23:50,000 are actively using the collections and the infrastructure at the Internet 1034 01:23:50,000 --> 01:23:53,001 Archive. And so what we were hoping was to bring some of those people together, 1035 01:23:54,000 --> 01:23:58,001 provide some visibility into the work that everyone is doing and start that 1036 01:23:58,001 --> 01:24:04,000 conversation so that we can do more together. So, to wind down here today on 1037 01:24:04,000 --> 01:24:08,001 behalf of the Internet Archive and our presenters today, I'd like to thank you 1038 01:24:08,001 --> 01:24:13,000 all for your time and your participation and a big thanks to everyone who's 1039 01:24:13,000 --> 01:24:17,000 joined in the multiple sessions that we've had along this series. Thanks 1040 01:24:17,000 --> 01:24:18,001 everyone. Have a great day.